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Abstract 

We develop subgradient- and gradient-based methods for minimizing strongly convex func¬ 
tions under a notion which generalizes the standard Euclidean strong convexity. We propose 
a unifying framework for subgradient methods which yields two kinds of methods, namely, the 
Proximal Gradient Method (PGM) and the Conditional Gradient Method (CGM), unifying 
several existing methods. The unifying framework provides tools to analyze the convergence 
of PGMs and CGMs for non-smooth, (weakly) smooth, and further for structured problems 
such as the inexact oracle models. The proposed subgradient methods yield optimal PGMs for 
several classes of problems and yield optimal and nearly optimal CGMs for smooth and weakly 
smooth problems, respectively. 

Keywords: non-smooth/smooth convex optimization, structured convex optimization, sub¬ 
gradient/gradient-based proximal method, conditional gradient method, complexity theory, 
strongly convex functions, weakly smooth functions. 
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1 Introduction 

Subgradient- and gradient-based methods for convex optimization have been actively investigated 
in the last decades, providing efficient solutions for large-scale optimization problems which arise 
from image/signal processing, data mining, statistics, etc. The efficiency of (sub)gradient-based 
methods are often analyzed from the viewpoint of oracle complexity [321 [34] to ensure a given 
absolute accuracy e > 0 for the optimal value, and so far various “optimal” methods are known 
for several classes of problems. Achieving the optimal complexity for subgradient methods usually 
requires a priori problem specific information; sometimes, however, we can attain optimal or nearly 
optimal complexity with less such requirements (but we may need some restrictions for their 
implementations). 

The following two classes of convex problems have been particularly well studied: 

• Non-smooth problems. The problems of minimizing Lipschitz continuous convex functions 
with bounded subgradients; 
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• Smooth problems. The problems of minimizing continuously differentiable convex functions 
with Lipschitz continuous gradients. 

These two classes of convex problems can also be reformulated as structured convex problems, 
which have been receiving much attention in terms of both theoretical and application aspects. In 
particular, studies of (sub)gradient-based methods for the class of “smoothable” functions [H |6l 
[911271 1351136], the class of composite problems [Ill5l[8llI71ll8llia|26l|38lll21[l3|, and the class of 
weakly smooth problems uniiaiMiiin] are notably important. 

In this paper, we particularly focus on the following two kinds of (sub)gradient methods: the 
Proximal (sub)Gradient Method (PGM) and the Gonditional Gradient Method (CGM). Both meth¬ 
ods may require easy-to-solve subproblems at each iteration. 

The PGM is executed using a prox-function to define a reasonable proximal operator. Based 
on the conceptual complexity of Nemirovski and Yudin [32|, many important PGMs for the above 
classes of convex problems can be proposed and their optimal convergence can be achieved. As it 
will be pointed out in this paper, many of PGMs are modifications, accelerations, and/or combi¬ 
nations of two remarkably important PGMs, namely, the Mirror-Descent Method (MDM) [H [32] 
and the Dual-Averaging Method (DAM) [37], which are optimal for non-smooth problems. 

The GGMs, on the other hand, are endowed by subproblems which are linear, i.e., problems 
of minimizing a linear functional over a bounded convex feasible set. Originating from Frank 
and Wolfe [15], convergence properties of GGMs are well analyzed (see [TOl [TS] [T6l [27] [l0[ [TT] 
and references therein). Because of their advantages such as easiness of subproblems and sparsity 
of approximate solutions, GGMs are actively studied with applications to machine learning and 
statistics [3 [m [231 [21]; it is important to note that the GGMs have worse convergence rates than 
the PGMs, but the computational cost of each iteration of the former can be lower, compensating 
the overall cost. Therefore, it is extremely important to choose between the PGM or the GGM 
depending on the structure of the problem to solve. 

In a recent work [22], a unifying framework of PGMs were proposed through a unifying treat¬ 
ment of the MDM and the DAM for non-smooth problems, and also for their corresponding accel¬ 
erations [421 [l3] for smooth (and structured) problems. This unifying framework enables one to 
generate a family of (optimal) subgradient methods which includes several existing optimal meth¬ 
ods. Also it permits to analyze both the classical PGMs (i.e., the MDM and the DAM) for non¬ 
smooth problems and their accelerations for smooth problems under the same framework, whereas 
existing analysis for them were performed individually. It is important to observe that if we do not 
restrict the discussion to the MDM and the DAM, other universal optimal complexity methods 
were previously proposed for both non-smooth and smooth problems as well [IIl[l2l[l8lll9ll26l[39]. 

The work [22], however, focused only on PGMs and was developed without assuming the strong 
convexity of objective functions. Using the knowledge of a strong convexity can help us to obtain 
much faster rate of convergence. For instance, the MDM [3[ [^ [30] for non-smooth problems and 
Nesterov’s PGMs [34[ [38] for smooth (or composite) problems realize the optimal complexity in 
the strongly convex cases. Moreover, exploiting multistage procedures is a powerful approach to 
obtain optimal PGMs [3 [HI [25] [ST] [33l |38]- However, the multistage procedures require a priori 
knowledge of an upper bound of the distance between the initial point and the optimal solution set. 
Note that the optimal complexity of the DAM for non-smooth problems and of the Tseng’s PGM 
for smooth problems are not known without the multistage procedure (see Sections 12.2112.3.2P . 

This paper proposes a new unifying framework of PGMs and GGMs for convex problems with 
strongly convex objective functions and its convergence analysis for both non-smooth and smooth 
problems. The smooth problems become particular cases of structured problems by employing the 
generalized notion of the inexact oracle model mm- It also enables us to handle simultaneously 
the weakly smooth problems. The proposed methods require a priori knowledge of the convexity 
parameter of the objective function, while an upper bound for the distance between the initial 
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point and the optimal solution set is not necessary to ensure the optimal convergence rate with 
respect to the iteration number. 

We emphasize three particular contributions of this paper. 

At first, the unifying framework yields generalizations of the MDM and the DAM originally 
proposed for non-smooth problems, and of Nesterov’s and Tseng’s optimal PGMs originally pro¬ 
posed for smooth (or composite) problems. As a consequence, the optimal convergence of the DAM 
and Tseng’s PGMs for the strongly convex cases are new since the existing results were analyzed 
only for the non-strongly convex cases (Sections 15.1115.31) . Our unifying framework also includes 
the classical gradient methods mi [38] which were previously analyzed in the strongly convex case. 
However, our analysis provides a slightly improved convergence estimates for them (Section 15.2|) . 

Secondly, a new family of CGMs can be obtained from the unifying framework, which in¬ 
cludes the ban’s GGMs m, and yields an optimal convergence result for smooth problems in the 
non-strongly convex case fSection 15.3p : we further prove nearly optimal convergence rates of the 
proposed GGMs for the classes of weakly smooth problems (Section 15.4.411 . The advantage of our 
unifying framework is a universal analysis of the PGMs and the GGMs. 

Finally, we prove that our PGMs (including generalizations of Nesterov’s and Tseng’s PGMs) 
attains the optimal convergence rate for weakly smooth and strongly convex problems (and for 
further extended problems of the deterministic case of [T8| , Section I5.4.3P . We remark that the 
original Nesterov’s and Tseng’s PGMs were analyzed for smooth (or composite) problems only. In 
contrast to the existing optimal method m, our PGMs ensure the optimality with less a prior 
information for the objective function. 

The current work can be seen as an extension of the recent work [22]. The above mentioned 
three new contributions are particular consequences of the extension. In particular, the previous 
one |22j can not consider the CGMs and the strongly convex cases. Moreover, we extended the 
structured problems of m so that we can now handle weakly smooth problems efficiently. 

Another extension from [22] is that our framework (Property [B]) handles two kinds of auxiliary 
subproblems at each iterations which allows us to yield new variations of subgradient method 
including the Nesterov’s method in [35j . 

This paper is organized as follows. We firstly discuss some general considerations about strongly 
convex problems in Section |2j In particular, in Section 12.11 we introduce a kind of “strong con¬ 
vexity” with respect to the prox-function and define the classes of non-smooth and of structured 
problems considered in this paper. We list some existing methods in the remaing part. We pro¬ 
pose the unified framework of subgradient-based methods and general guidelines for constructing 
subproblems in Section [3l We analyze the proposed general (sub)gradient methods and establish 
general convergence results in Section 01 Finally, in Section O we discuss the rate of convergences 
for the non-smooth and the structured problems providing the (nearly) optimal complexity for 
them. 

2 Problem settings and existing methods 

2.1 Convex optimization problem and assnmptions 

Let us consider the following convex optimization problem: 

min/(x) (1) 

where Q is a closed convex subset of a finite dimensional real normed space E equipped with a 
norm || • ||, and f : E ^ MU{-|-oo} is a lower-semicontinuous (Isc) convex function with Q C dom/. 
We denote by E* the dual space of E equipped with the dual norm ||s||* = max|| 3 ,||<i {s,x) for 
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s & E* where {s, x) is the value of s G i?* at x G -E. We always assume that the problem ([T]) has 
an optimal solution x* G Q. 

Throughout this paper, we mainly focus on two particular classes of convex optimization prob¬ 
lems the class of non-smooth problems and the class of structured problems, which will be 
defined shortly. 

We introduce a prox-function d{x) on the feasible set Q, that is, d : El —)• M U {+ 00 } is a 
nonnegative, continuously differentiable, and strongly convex function on Q (therefore, Q C dom d) 
with a constant ad > 0 such that d(xo) = min 3 ;gQ d(x) = 0 for the unique minimizer xq G Q. We 
use the notation ld{y\x) := d{y) + {Vd{y),x — y) for the linearization of d(x) at y G Q. We also 
define the Bregman distance [7j between x and ?/ for x, y G Q by 

f{y, x) := d{x) - d{y) - {Vd{y),x - y) = d{x) - ld{y, x). 

Note that the strong convexity of d(x) on Q is equivalent to the property f{y,x) > ^||x — 
y|P, Vx, y G Q. The prox-function as well as the Bregman distance will be used for the construction 
of auxiliary functions in the subproblems solved at each iterations in the methods described in this 
paper. We also assume that the prox-function d(x) is fixed throughout the paper. A simple 

example for d(x) is the Euclidean setting, namely, El is a Euclidean space with ||x ||2 = (x,x)^'^^, 

and d(x) = ^||x — xq\\\ for some xq G Q. 

For a Isc convex function ■i/’ : El ^ M U {-boo} with Q C dom'i/’, we introduce the set 

:= {r > 0 : ifix) — Td{x) is a Isc convex function on Q}. 

The set a{^p) corresponds to the set of “convexity parameters” of tpix) on Q with respect to the 
prox-function d(x). In the Euclidean setting d(x) = i||x —xo|||, the set a{'if) is the set of convexity 
parameters of fj{x) in the usual sense. Furthermore, in general, it can be shown that r G a{'il)) if 
and only if the following inequality holds: 

\f{x)>f^{y) + f^'{y,x-y)+Tf,{y,x), Vx,y G Q (C domV’), (2) 

where 'ip'{x\d) = limQ^o V’(a:+a^)-b(^) ^ dom'i/;, d G El) 0. This form is similar to the char¬ 

acterization of the usual strong convexity of i/;(x) on Q with constant r > 0: ip{x) > 'i/’/y) -b 
'ip'{y,x — y) + ^\\x — y|p, Vx,y G Q. Therefore, r G it(V’) implies the usual strong convexity 
of Tpix) on Q with constant ru^, since f{y,x) > ^||x — y|p, Vx,y G Q. On the other hand, 
if the Bregman distance f^(y,x) grows quadratically on Q with a constant A > 0 (see m), 
f,iy,x) < ^\\x — y|p, Vx,y G Q, then the usual strong convexity of fj^x) on Q with a constant 
r > 0 implies t/A G a{'tp). 

We assume a “strong convexity” of the objective function /(x) by supposing that o'(/)\{0} / 0. 
However, in order to deal with several structured optimization problems as we will see in Section 
Ea we need to assume stronger conditions on the objective function as follows. Let us assume 
that, for each y £ Q, there exists a Isc convex function mf{y;-) : El —M U {-boo} such that 
mf{y,x) < f{x) for all x £ Q] we call the function mf{y;x) a lower approximation model of f{x). 
We further assume that there exists a convexity parameter af >0 such that 

(Tf £a{f)n f]a{mf{y,-)). (3) 

y&Q 

^Notice that the function := '0(x) — rd(x) satisfies {y, x — y) = tp'{y; x — y) — t {Vd{y),x — y), V®, y € Q. 
Hence, the convexity of y>{x) on Q implies (f>{x) > y{y) + ip'{y,x — y),'ix,y G Q, which is equivalent to ([2|. 
Conversely, since ip' {y; x — y) > —tp'{y,y — x) holds and so is true for ipf) for x,y G Q, ([^ implies the two inequalities 
(p{y) > ip{z) + y’{z-, y — z) and y{x) > y[z) — y'{z-, z — x) for x,y,z G Q. Since (p'{y, •) is positively homogeneous, the 
convexity of ipf) on Q follows by taking a convex combination of the two with z = ax -h (1 — a)y, a G [0, 1], ®, y G Q. 
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Note that, since f'{x*;x — x*) > 0 holds for all x G Q by the optimality of x*, the condition 
af G o'(f) implies that f(x) — f{x*) > af^{x*,x) for all x £ Q. 

The function mf{y;x) can be seen as a strongly convex lower approximation of f{x) at y £ Q, 
and its construction depends on the problem structure. Notice also that the condition ([3]) is not 
as restrictive as it is apparent to be specially if the problem ([T]) is provided by some structure. 

The convex optimization problem ([1]) which we consider in this paper will be particularized 
into the following two classes for convenience. 

Definition 2.1. The class of non-smooth problems consists of convex optimization problems m 
where we assume for each problem that we know a subgradient mapping g{x) £ df{x), x £ Q and 
a convexity parameter (Tf £ ct(/). Then, we can naturally define its lower approximation model 
by 

mfiy; x) := f{y) + {g{y),x - y) + crf^{y, x). (4) 

Therefore, it satisfies Moreover, we assume that for every s £ E* and /3 > 0, the following 
optimization problem is solvable: 

min{(s, x) +/3(i(x)}. (5) 

xeQ 

This class of problems is denoted by MSV{g,(Jf). 

Notice that non-smooth problems satisfy the requirement ([3]) because mf{y\x)—afd{x) becomes 
an affine function. For convenience, we denote g^ := g{xk) £ df{xk) for test points x^- 

Definition 2.2. The class of structured problems consists of convex optimization problems m 
where we assume for each problem that there exists •), u/, cj/, L(-), (5(-, •)), i.e., functions and 

constants, satisfying the inequality 

f{x)<[mf{y-x)-aff,{y,x)] + ^^\\y-xf+ 6{y,x), 'ix,y£Q, (6) 

where is a lower approximation model of f{-) which admits ^ for af > 0, <5(y, •) is a 

nonnegative convex function on Q for y £ Q, L(-) > 0, and af £ [0, cjj]. We further assume that 
for every fi > 0, y £ E and s £ E*, the optimization problems of the following form is efficiently 
solvable: 

mm{mf(y,x) + {s,x) + l3d(x)}. (7) 

xeQ 

This class of problems is denote by SV{mf,af,af,L,5). 

Examples of such structured problems will be presented in Section 12.3.11 

The optimization problem ([7]) in the class of structured problems may differ from ([5|) in the 
class of non-smooth ones depending on how we choose the functions mj(-; •) [e.g., see the example 
(ii) in Section [2.3.111 . 

Note that when /? = 0 and crj = 0, problem ([7]) may be a minimization of a convex function 
which is non-strongly convex, in particular, an affine function on Q. In this case, we additionally 
assume the boundedness of Q to ensure the existence of its solution. This is the case for the 
conditional gradient methods. 

After developing a general analysis in Section^ the function 6{y, x) will be hnally particularized 
for the constant case d{y,x) = <5 in Sections 15.2115.3t and for the case 5{y,x) := ^\\x — y\fi, M > 
0, p£ [1,2) in Section [53] (see Section [T3] for several examples and related works). Note that, 
when 6{y,x) = 5 and af = 0, the structured problem is equivalent to the one introduced in [22] 
Section 5]. 
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2.2 Existing methods for non-smooth problems 

Consider the non-smooth problems in the class MSV{g,af). We assume for the moment that the 
subgradient mapping g{x) E df{x) of f{x) is bounded, i.e., there exists M > 0 such that 

\\g{x)\\^<M, Vx E Q. (8) 

Let us first consider the case af = 0. The original MDM and DAM, which solve this class of 
problems, are known to be optimal PGMs. Considering the notation in |22t Method 9(a)], they 
are particular cases of the following procedure: 

xo ■= Z-i := argmind(x), Xk+i '■= Zk, k >0, (9) 

xgQ 

where Zk is the solution of the following fixed subproblem either from the extended Mirror-Descent 
(MD) model 

mm{Xkmf{xk] x) + Pkd-ix) - f3k-ild{zk-i; x)}, (10) 

xGQ 

or from the Dual-Averaging (DA) model 

min < Ajm/(xi;x)-I-^fcd(x) > , (11) 

J 

where {Aa;}a;>o and {f3k}k>-i are positive parameters called weight (or step-size) and scaling pa¬ 
rameters, respectively; recall that mf{y,x) = f{y) -\- {g{y),x — y) by the definition dH) if uj = 0. 

The MDM, originally proposed by Nemirovski and Yudin m and related to proximal sub¬ 
gradient methods by Beck and Teboulle [1], corresponds to the method ([9]) with the update (fTOjl 
letting /3fc = 1. On the other hand, the method ([9]) with the update m yields the original 
DAM proposed by Nesterov m- Tuning the scaling parameter {Pk} enables us to obtain an 
efficient convergence rate (see ESIET]); for instance, taking A^ = 1 and jdk = 0{y/k) yields that 
f{xk) — f{x*) < 0{1/Vk) where Xfc := Yli=o Yli=o In this case, one needs the values d(x*) 
and M to dehne A^ and/or (3^ to achieve the optimal iteration complexity 0{M‘^d{x*)/{ad£^)) for 
an absolute accuracy e > 0. 

When (Tj > 0 is known, the extended MDM also admits the optimal complexity 0(M^/ (aucrfe)) 
for the strongly convex case by choosing A^ := fdk ■= 1 ([30], Theorem Ij; see also [H |29] 

for related results). Moreover, it is proved that a multistage procedure for the DAM achieves 
the optimal complexity for problems of minimizing uniformly convex functions, a generalization of 
strongly convex ones, with further consideration in a stochastic setting |25j . 

As we mention next, an extended class of problems including non-smooth and smooth ones are 
considered in [IHl (H ED |39] which propose optimal PGMs for these problems and therefore for 
the non-smooth problems as well. 

2.3 Examples and existing methods for structured problems 
2.3.1 Examples of structured problems 

The class SV{mf,af,af,L,5) of structured problems introduced in Section [2T] includes several 
special convex problems that are also possibly non-smooth. We list some existing examples and 
results which can be discussed in this setting considering the requirements ([3|) and Q. 

(i) Smooth problems. Suppose that /(x) belongs to C])^{Q)-, that is, /(x) is continuously dif¬ 
ferentiable on Q and V/(x) satisfies the Lipschitz condition on Q with constant L > 0: 
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||V/(x) — V/(y)||* < L\\x — y\\, Vx,y G Q. When we know a constant aj € cr(/), we can 
define 

mf{y, x) := /(y) + {Vf{y),x - y) + crf^{y, x) 

to obtain ([3]) and ([6]) with L(-) := L, af := af, and 6{-, •) := 0. The corresponding subproblem 
(l7ll reduces to the form ([5]). 

The smooth problem with the Euclidean setting d{x) = ^||a: — xo||| is the most basic one 
among the examples here; in this case, the lower complexity bounds 0 (y^ Ld{x*)/e) for the 
case aj = 0 and f log(l/e)) for the case u/ > 0 are known for an absolute accuracy 

e > 0. The first optimal PGM for the Euclidean case was proposed by Nesterov [33] and its 
variants were developed in [M], and in |2l|35| for non-strongly convex cases. 

CGMs are also considered for the smooth problems, which achieve the complexity 0(Lii/e) 
where R := Diam(Q) = sup^, ,j/eQ - y|l Uni (131 [13 ESI EH mi; excepting ban’s modified 
GGMs |27j . all of these CGMs are based on the classical CGM [T5|, as we show in the 
algorithm (IlSp . 

(ii) Composite problems. Consider an objective function /(x) of the form f{x) = fo{x) + ^{x) 
where /o G C^^{Q) and iP'(x) is a Isc convex function on Q with a simple structure. If we 
know constants G (t(/o) and Uip- G o'{'F), then, we can take 

mfiy, x) := fo{y) + (V/o(y), x - y) + (Tf^^{y, x) + •P'(x) 

from which ([3|) and ([6|) hold with u/ := af^ + aq,, L{-) := L, df := ct/q, and := 0. 
There are many PGMs for this problem EZlEllsHlSaiSSI and they provide the same iteration 
complexity as the lowest complexity for the smooth problems in the non-strongly convex case 
(excepting the work by Eukushima and Mine [T7j because they studied this model without 
assuming the convexity for fo{x)). Nesterov [38| further proposed an optimal method for 
strongly convex composite problems in the Euclidean setting. The smoothing technique 
proposed by Nesterov [35| and its extension [6] for a special form of !P'(x) are also important 
because of their significant advantage in efficiency, which have further consideration in the 
strongly convex case [36] - 

A generalization of GGM to the composite problems was investigated in m which also deal 
with a duality relationship to the MDM. 

(hi) Inexact oracle model. Suppose that /(x) is equipped with a first-order {5, L, fj.)-oracle [IT] . 
i.e., for each y G Q, we can compute {f 5 y^pi[y), g6y^pi{y)) G M x E* such that 

|lk - yf < /(x) - {fs,L,^iy) + - d)) ^ ^11® “ vf + Vx g Q, 

where <5 > 0 and L > y. > 0. If/r = 0or the prox-function grows quadratically on Q with 
constant A > 0, then defining 

mf{y; x) := f5,L,f,{y) + {g 5 ,L,f,{y),x - y) + -^^(y, x), 

admits (|3|) and ([6]) with L(-) := L, u/ := df := y/A, and (5(-,-) := 6. The inexact oracle 
model with y = 0 was firstly studied by Devolder et al. [T2| and they proposed the classical 
and the fast (proximal) gradient methods which were extended to the strongly convex case 
in m- A GGM for this model in the case y = 0 was analyzed by ESj. 
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(iv) Weakly smooth problems. Suppose that the objective function f{x) belongs to (Q) for 
some v G [0,1), i.e., f{x) is continuously differentiable on Q and V/(x) satisfies the Holder 
condition ||V/(x) — V/(y)||* < M\\x — y\\'^, Vx, y G Q; but in the case i/ = 0, we do not 
assume the smoothness for f{x) and we understand V/(x) as an element in df{x). Since the 
Holder condition implies the inequality 

f{x)-f{y)-{yf{y),x-y) <-^^\\x-y\\^^'', '^x,yGQ, (12) 


defining mf{y;x) as (i) for aj G cr(/), it admits ([31) and ([U]) with L(-) := 0, aj := aj, 
and := — y\\^^'^■ The weakly smooth version of the composite and the saddle 

structures can also be considered in the same way. 

For the weakly smooth problems, Nemirovski and Nesterov [31] (see also [141 Section 2.3]) 
proposed an optimal PGM with the (optimal) complexity bounds 


Clip) 


3p-2 



and 


C2ip) 


/M^ 1 \3p-2 


(13) 


for non-strongly and strongly convex cases, respectively, where p := 1 + G [1, 2), ci(-), C 2 (-) 
are continuous functions, and u > 0 is a convexity parameter of / with respect to the 
norm || • ||; the proposed method is further applicable for more general classes of problems. 
Moreover, Nesterov m improved a restriction of the method in the non-strongly convex case 
in the sense that the proposed method ensures the optimal convergence rate without fixing 
the iteration number. It is important to note that the methods proposed by m and [39] 
can achieve the above complexity of iterations for non-strongly convex case even if we do not 
know M and ly while the proposed method here needs an additional (but relatively small) 
“cost” for estimating M. This approach can be also seen in [31331138] for an estimation 
of the Lipschitz constant M in the case u = 1. The studies mm of the inexact oracle 
model are also important; they proposed an optimal method for weakly smooth problems in 
the non-strongly convex case and a sub-optimal one in the strongly convex case (PGMs for 
uniformly convex functions are also discussed). 

A convergence result for GGMs for this class can be also obtained in the same way as the 
smooth problems which ensures the complexity 0((Mi?/e)^/'^) where R := Diam(Q) (see [D] 
Proposition 1.1] and [30]i. 

(v) The objective functions in (i) and (iv) can be simultaneously considered by assuming 
fiv) - fix) - {giy),y - x) < ^\\y - xf +—\\y - x\\p , yx,yeQ, 

2 p 

for a subgradient mapping y(x) G dfix), L,M >0, and p G [1,2). When aj G ct(/), we can 
take mfiy; x) := fiy) + igiy),x — y)+crf^iy, x) to obtain ([3]) and ([6]) with L(-) := L, af := aj, 
and 6iy,x) := ^\\y — xjj^. When cj/ = 0 or the prox-function grows quadratically on Q, 
(nearly) optimal PGMs for this model in the case p = 1 are studied in [3 ISl UHl [23128] with 
a stochastic setting. 


2.3.2 Existing methods for structured problems 

We finally describe some particular PGMs and GGMs which will be important for the comparison 
with the proposed methods in the paper. For that, we introduce two kinds of update formulas of 
gradient-based methods. 





The first is the Classical Gradient Method |22l Method 16], which performs as follows: For 
given weight {Afc}fc>o and scaling parameters {/3fc}fc>_i, generate {zk}k>-i and {xk}k>o by the 
update Q with the model (fTOl) or dm), and set {xk}k>o by Xk = Y!i=o ^i^i/ Y^i=o The primal 
and dual gradient methods in |38] for the composite problems (ii) and in [12] for the inexact oracle 
model (hi) are closely related to this algorithm in the non-strongly convex case. A further relation 
in the strongly convex case will be presented in this paper. 

The second, the Fast Gradient Method (FGM) [22l Method 17], is described as follows: For 
given weight {Afc}fc>o and scaling parameters {j3k}k>-i-, set xq := Z-i := 3xgmm.^^q d{x), xq := zq 
and, for k >0, iterate 


^fc + 1 
^fc + 1 


(1 - rk)xk + TkZk, where Tk := 
(1 X}^')Xk “ 1 “ 


-^ fc +1 

2^i=0 


Ai’ 


(14) 


where Zk is determined by the fixed subproblem either the extended MD model (IlOp or the DA 

model (fTT]l . It was indicated in [22| that the FGM with Aq := l,Afc+i := — 

0), and j3k = L/ud yields Tseng’s accelerated PGMs [l3] for the composite problems which 
achieve the convergence rate f{xk) — f{x*) < 0{Ld{x*)/{adk'^)) yielding the optimal complex¬ 
ity 0{^yLd{x*)/ (u^e)) as (i) in the non-strongly convex case. 

Furthermore, the algorithm (|14l) is also closely related to the following PGM and GGM, which 
will be unified in the framework of this paper: 

• Replacing the second update in (fTT)l by Xk+i ■= {l — Tk)xk+XkWk+i, determining Wk and Zk by 
(fTOl) and (fTT]l with fdk := L/ad, respectively, the corresponding method with Xk := {k -|- l)/2 
yields the Nesterov’s optimal PGM |35l Section 5.3] for the smooth problems in the non- 
strongly convex case. We remark that the achievement of the optimal complexity of the 
FGM and this Nesterov’s PGM in the strongly convex case are not known without using 
multistage procedure; in the Euclidean setting, it turns out that a multistage procedure for 
them attains the optimal complexity 0{^jL/og log(l/e)) in the strongly convex case (see, 
e.g.^ [Ml Section 5.1]jl. 

• Letting Xk := {k + l)/2 and assuming the boundedness of Q, the algorithm (1141) with the 
subproblems (|10p and dill) with j5k^^ corresponds to ban’s modified GGMs, Algorithms 4 
and 5, respectively, in m with the stepsize policy := 2/(/c -|- 1) and 6k '■= k. 

On the other hand, the classical GGM [iniiiaiii] for smooth problems is basically performed 
as follows: Choose xq & Q and, for k > 0, iterate 


Zfc E Argmin(V/(xfc),x - Xfc) , Xk+i := {1 - Tk)xk + TkZk, k>0 (15) 

x&Q 

where Tk E [0,1) (we assume the boundedness of Q). Excepting the ban’s modified GGMs, all the 
above mentioned GGMs are based on this classical GGM. Notice that the subproblem can be seen 
as the extended MD model (fTOl) with /3fc = 0. 


3 Unifying framework for (sub)gradient-based methods 

In this section we define the unifying framework, namely Methods 13.11 and 13.21 combined with 
Property |A] and [HI which provides a generalization of some existing methods and new convergence 

^In fact, since tiiey have the convergence rate — ^ constant c > 0, after k > ^j2cLjaf 

iterations, we have f{xk) — f{x*) < —a:*!!! < ^ifixo) ~ fix*)) by the strong convexity of / and the optimality 

of X*. Then repeating 0(log2(l/£)) times of restarting the method every •j2cLlaj iterations, it ensures an e-solution. 
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results with a universal analysis. The proposed methods require the computation of minimizer(s) 
Zk (and Wk) of one or two auxiliary problem(s) at each iterations as the existing methods presented 
in Sections 12.21 and 12.3.21 In order to simplify the notation, we introduce auxiliary functions 
ipk{x) and f’kix), and denote the minimizers of our subproblems as ■= aig'a\\\i^^Qipk{x) and 
Wk := argmin,j,gQ V’fc(a;)- 

Now let us see how we proceed in specifying our (sub)gradient-based methods. They are 
determined by the parameters {Afc}fc>o, {/3fc}fc>_i, and functions {{ipk{x),'ifk{x))}k>-i-, where 

• {Aa:}a:>o is a sequence of positive real numbers, the weight parameters, 

• {/3fc}fc>-i is a nondecreasing sequence of nonnegative real numbers, the scaling parameters, 
and 

• {{(fkix), f^kix))} k>-i is a coupled sequence of auxiliary functions which are minimized at each 
iterations. 

We always assume that weight parameters are positive and that scaling parameters are nonnegative 
and nondecreasing. Remark that these objects are possibly determined in a recursive manner during 
the methods. Then our methods generate the following sequences in Q. 

• {xk}k>o is the sequence of test points for which we evaluate mf{xk', x). 

• {zk}k>-i is the sequence of solutions of subproblems (fkix). 

• {wk}k>-i is the sequence of solutions of subproblems minx^Q'tpkix)- 

• {xk}k>o is the sequence of approximate solutions for the problem (fTI) . 

In view of our actual construction dehned in Section 13.31 we suppose that the auxiliary functions 
{{ipk{x), ijjk{x))}k>-i are constructed associated with weight parameters {Afc}fc>o, scaling parame¬ 
ters {f3k}k>-i, and test points {xk}k>o in a recursive manner. We often consider the case of a single 
sequence of auxiliary functions which can be regarded as the case 'ifkix) = ^k{x). 

We will gradually specify the above general objects by giving explicit update formulas in three 
steps: The hrst is for the points {xk}k>o and {xk}k>o by proposing general (sub)gradient-based 
methods (Section 13.2p . the second is for the auxiliary functions {{ipk{x),tpkix))}k>-i used in the 
general methods (Section l3.3p . and the final is for the parameters {Afc}fc>o and {I3k}k>-i to provide 
efficient convergences (Section [5]). 

3.1 General properties for the construction of auxiliary functions in the unify¬ 
ing framework 

We begin by describing general properties which the auxiliary functions {{ipk{x),'ipkix))}k>-i 
should satisfy. These properties will guide us in how to iteratively construct the auxiliary func¬ 
tions. The first set of properties is for a sequence of auxiliary functions {fk{x)}k>-i- We define 

•= b ‘S'-i = 0. 

Property A (in the unifying framework). Let {^k{x)}k>-i be a sequence of auxiliary functions 
associated with weight parameters {Afc}fc>o, scaling parameters {I3k}k>-i, cind test points {xk}k>o- 
Let aj > 0 be a convexity parameter satisfying for some lower approximation model mf{y,x) 
of f{x). Denote Zk := aigvahi^^Q (pk{x^. Then, the following conditions hold: 

(Al) ip-i{z-i) = 0 and Z-i = xq. 

®The auxiliary function ipk{x) can possibly be an affine function. In that case, we will assume the boundedness 
of Q in order to ensure an existence of a minimizer Zk- 
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(A2) Mk > —1, \/x G Q, we have 

^k+i{x) > i^kizk) + \k+i'mf{xk+i]x) + /3fc+id(x) - I3kld{zk',x) + Skcrf^{zk,x). 

(A3) V/c > -1, ifk{zk) < min^gQ + hld{zk\x) - S'fc(T/^(zfc, x)|. 

The above property is a generalization of Property 2 |22] which is particularized by tak¬ 
ing fjj = 0. As a simple extension of Property O we further consider a coupled sequence 
{{ipk{x),'i)k{x))}k'>-i of auxiliary functions which admits the property below. 

Property B (in the unifying framework). Let {{^pk{x)-,'i(k{x))}k>-i be a coupled sequence of 
auxiliary functions associated with weight parameters {Afc}fc>o, scaling parameters {/3fc}fc>-i, and 
test points {xk}k>o- Denote Zk '■= aigmin^^q ipk{x) and Wk ■= encgmin^^q fjkix). Let af > 0 be a 
convexity parameter satisfying (0) for some lower approximation model mf{y;x) of f{x). Then, 
the following conditions hold: 

(BO) ipk{x) > fpkix) for all x £ Q. 

(Bl) = 0 and Z-i = W-i = xq. 

(B2) V/c > —1, Vx G Q, we have 

f’k+iix) > ipk{zk) + Afc+im/(xfc+i; x) -|- I3k+id{x) - I3kld{zk',x) + Skcrf^{zk,x). 

(B3) V/c > -1, iik{wk) < mina;gQ |l]i=o ac) + I3kld{zk\x) - SkO fi{zk.,x)^. 

Note that letting fjkix) = Tk{x), it yields Property O 


3.2 (Sub)gradient-based methods in the unifying framework 

We propose the following (sub)gradient-based methods for non-smooth problems (Method 13.111 
and structured problems iMethod 13.21) . respectively. Each of them have two types of updates, the 
classical and the modified ones. 


Method 3.1 (Subgradient-based methods for non-smooth problems in the unifying framework). 
Consider a non-smooth problem in the class M SV {g, a f). Let {Afc}fc>o and {/3k}k>-i be sequences 
of weight and scaling parameters, respectively. Generate a sequence {{zk-i,Xk,gk-,Xk)}k>o by either 
the classical or the modified method as follows. 


(0) Set Xq := Xq := z^i := argmin,j,gg (i(x). 

(1) (k-th iteration, k>0) Set gk := g{xk) G df{xk) and compute Zk,Xk+i,Xk+i by 


Classical method 
or 

Modified method 


Xk+i := Zk := argmin(^fc(x), 

x£Q 


^/c+1 • 


Sk^k “1“ 


Zk := argminy?fc(x), 

x^Q 


^k-\-l •— ^k-\-l • — 


Sk^k ^k-\-l^k 
Sk+i 


Here, {ipk{x)}k>-i is a single sequence of auxiliary functions satisfying Provertv lAl 

Note that we did not use a coupled sequence {{ipk{x), ^pkix))}k>-l of auxiliary functions be¬ 
cause we will see that their analysis (Lemmas 14.6114.71 and 14.81) for the non-smooth problems are 
independent of the second object {ipk{x)}k>-i (or Wk). 
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Method 3.2 (Gradient-based methods for structured problems in the unifying framework). Con¬ 
sider a structured problem in the class SV {m f, a f, a f, L, 5). Let {Xk}k>Q {f^k}k>-i he sequences 

of weight and scaling parameters, respectively. Generate a sequence {{zk-i,Wk-i,Xk,Xk)}k>o by 
either the classical or the modified method as follows. 


(0) Set xq := Z-i := W-i := argmin^^gg (i(x). Compute 

zq := argmin </5 o(3:), xq := wq := argmin'(/’o( 2 ;). 

x£Q x£Q 


(1) (k-th iteration, k>0) Set 

■ — 

Wk+1 := 
^k-\-l ■ 


Zk ^ 

Sk+i 

argmin (pk+i{x), 
xeQ 

argmin fik+i{x), 

x^Q 

Sk^k '^/c+l’^fc+l 

Sk+i 


Classical method. 
Modified method. 


Here, {{ipk{x),'fik{x))}k>-i is a coupled sequence of auxiliary functions satisfying Provertv Wl 

The implementation of these methods will require a more specific construction of auxiliary 
functions {{'Pk{x),'4’kix))}k>-i as we will see next. 


3.3 Construction of auxiliary functions in the unifying framework 

Here we provide some formulas to construct a coupled sequence {{q:>kix),'f’kix))}k>-i of auxiliary 
functions which admit Property [Bj For that, we firstly construct a single sequence of auxiliary 
functions {v^fc(x)}fc>_i satisfying Property lAl 

Theorem 3.3. Given the weight parameters {Afc}fc>o, the scaling parameters {/3k}k>-i, the test 
points {xk}k>o, o-nd a convexity parameter af > 0 satisfying ^ for some lower approximation 
model mf{y\x) of f{x), construct the sequence {ipk{x)}k>-i of auxiliary functions as follows. 
ip-i{x) := f5-id{x), Z-i := xq and, for k > —1, define 

ifk+iix) := ipk{zk) + Afc+im/(xfc+i; x) + I3k+id{x) - /3kld{zk',x) + Sk(Tff{zk,x) (16) 

or 

(Pk+i{x) := (pk{x) + Afc+im/(xfc+i; x) -b (3k+id{x) - (3kd{x). (17) 

Then, the sequence {<Pk{x)}k>-i satisfies Provertv [Al 

The assumption z-i := xq is satisfied whenever /3_i > 0 because mina;gg d{x) = d(xo) = 0, but 
it is required when /3_i = 0; in both cases, the condition (Al) holds. To prove Theorem 13.31 it 
remains to show (A2) and (A3) which will be done in Lemmas 13.61 and 13.71 respectively. 

The following theorem is a simple consequence of Theorem 13.31 

Theorem 3.4. Let {(pk{x)}k>-i he generated accordingly to the construction in Theorem, 17. ,71 as- 
sociated with weight parameters {Afc}fc>o, scaling parameters {fik}k>-i, test points {xk}k>o> o,nd a 
convexity parameter crj >0 satisfying (0) for some lower approximation model mf{y;x) of f{x). 
Define {fikix)}k>-i by fi-iix) := <y9_i(x) and 

'ipk+i{x) := (pk{zk) + Afc+im/(xfc+i; x) + I3k+id{x) - fikldizk] x) + Sucffi{zk,x). (18) 

Then, the sequence {{(pk{x),fikix))}k>-i satisfies Property iHl 
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Proof. Notice that (IlSp satisfies the condition (B2) as equality. The condition (Bl) is immediate 
from the condition (Al) for {tpk{x)} and the definition := (^_i(x). Since (fTSll coincides 

with the right hand side of (A2) for {(^fc(x)}, the condition (BO) is clear. Finally, the condition 
(B3) is satisfied by (BO) and (A3) for {(/?fc(x)}. □ 

Before proving Theorem 13.31 let us see some particular constructions of auxiliary functions, 
which will be useful for the comparison with some existing methods. 

• Extended MD model. Define {(pk{x)}k>-i by if-i{x) := /3_id(x) and 

(Pk+i{x) := (fkizk) + Xk+imf{xk+i,x) + /3fc+id(x) - Pkld{zk',x) + SkcrfC{zk,x) (19) 
for A; > — 1. Then, Property lAI follows from Theorem 13.31 with the update (11611 . 

• DA model. Define {ipk{x)}k>-i by ip-i{x) := f3-id{x) and 

k 

ipkix) :='^Ximf{xi;x) + (3kd{x) (20) 

i=0 

for A: > — 1. Then, Property lAl follows from Theorem 13.31 with the update ()17ll . 

• Hybrid model. Define {{ipk{x).,'il)k{x))} by '0-i(a^) := ld-id{x) and 

Pk{x) ■■= + hd{x), 

i^k+iix) := min^gQ Lpk{z) + Xk+imf{xk+i]x) -(- /3fc+id(x) - I3kld{zk\x) + Skcrff,{zk, x) 

(21) 

for A: > — 1. Then, Property iBl follows from Theorem 13.41 with the update (I18p . 

Consequently, Method [3.II provides at least four particularizations; we can choose the classical 
or the modified updates combined to the choice of the auxiliary functions constructed by the 
extended MD model ()19p or by the DA model (|20l) (or arbitrarily combination of them). Notice 
that subproblems Zk := argmin 2 ,gQ ipk{x) in these particularizations can be solved as the form ([5]). 

Method 13.21 yields at least six particularizations due to the additional choice of the hybrid 
model m- However, employing the models (fT^ or (jMP in Method 13.21 reduces the number of 
subproblems at each iteration since Note that only the extended MD model ()19l) turns 

the subproblem Zk ■= argmin^-gg (^^(x) of the form ([7]); the others require the solution of the 
subproblem (|lip . However, the subproblems with these models have the same difficulty for all the 
examples cited in Section 12.31 

We remark that Theorems 13.31 and 13.41 give infinitely many ways of constructing{(99fc(x), 'i/’fc(x))} 
because we can mix the updates (fT6P and (fT7P in any order. 

3.4 Proof of Theorem 13.31 

Now let us complete the proof of Theorem 13.31 

Lemma 3.5. Let {ipk{x)}k>-i be generated accordingly to the construction in Theorem\3fM asso¬ 
ciated with weight parameters {Afc}fc>o, scaling parameters {f3k}k>-i, AesA points {xfc}fc>o, and a 
convexity parameter u/ > 0 satisfying (0) for some lower approximation model mf{y;x) of f{x). 
Then, for every k > —1, we have 

Tk{x) > ^Pk{zk) + {/3k + Sk(7f)C{zk,x), Vx e Q, Vk > -1. 
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Proof. Since af € a{mf{xi, •)) for i > 0, we can see inductively that f5k + Skcrj G cr{ipk) for all k > 

— 1. Therefore, using its characterization ([2]), the optimality condition — Zk) > 0,Vx G Q 

for the minimizer z^ = arginin,j,gg fki^) yields the conclusion. □ 

Lemma 3.6. Let {^kix)}k>-i be generated accordingly to the construction in Theorem\3f^ asso¬ 
ciated with weight parameters {Xk}k>o> scaling parameters {ldk}k>-i, test points {xfc}fc>o, and a 
convexity parameter aj >0 satisfying ^ for some lower approximation model mf{y,x) of f{x). 

Then, the condition (A2) holds. 

Proof. Notice that the construction (I16p satisfies (A2) as equality. In the case of the construction 
(fT7|) , Lemma 13.51 yields for any x G Q that 

ifk+iix) = ipkix) + Xk+imf{xk+i]x) + (3k+id{x) - f3kd{x) 

> [(fkizk) + Wk + Skcrf)f,{zk,x)] + Afc+im/(xfc+i; x) + /3k+id{x) - (3kd{x) 

= Tk{zk) + Afc+im/(xfc+i; x) + /3k+id{x) - I3kld{zk',x) + Sk(Jfi{zk,x) 

which is the condition (A2) for k> —1. □ 

Lemma 3.7. Let {ipk{x)}k>-i be generated accordingly to the construction in TheoremlSfM asso¬ 
ciated with weight parameters {Afc}fc>o, scaling parameters {I3k}k>-i, test points {xk}k>Q, and a 
convexity parameter crj >0 satisfying ^ for some lower approximation model mf{y,x) of f{x). 

Then, the condition (A3) holds. 

Proof. We prove the assertion by induction. Since Z-i = xq = argmin 2 ,gQ d{x), we have min 3 ;gQ l^{z-i]x) 
minjjgQ d(x) = 0 which proves (A3) for k = —1. Assume that (A3) holds up to A; > —1. In the 
case when all {Tiix)}i=Q are constructed by (fT7)l . it coincides with the formula (f20]l . Therefore, 
Lemma 13.51 implies that 

k 

Tk{zk) < Tk{x) - iPk + Skcrf)f,{zk,x) = ^Aim/(xi;x) + f3kd{x) - (fdk + Skcrf)f,{zk, x) 

i=0 

k 

= ^Aim/(xi;x) + (ikldizyx) - Skcrfi{zk,x) 
i=0 

for every x G Q, from which the condition (A3) follows. If this is not the case, there exists some 
integer j < k such that (pk+i{x) is constructed as defining ipj^i{x) by (fT6l) and ipj^ 2 ix ),..., (pk+i{x) 
by (dZl). Then, we have 

fc+i 

ifk+iix) = vtim.Lpj{z) + Ximf(xi;x) +/3fc+i(i(x) - /3jld(zj;x) + Sjaff,izj,x) 

*=j+i 

which yields (pk-\-i{x) < + (3k+id{x) by the induction hypothesis (A3) for ‘Pj{x). 

Therefore, Lemma 13.51 implies for every x G Q that 

Lpk+i{zk+i) < <pk+i{x) - {Pk+i +Sk+icrf)i{zk+i,x) 
k+l 

< ^ Ximf{xi; x) + /3k+id{x) - {/3k+i + Sk+icrf)f{zk+i,x) 

i=0 

k+l 

= '^Ximf{xi;x) -P /3k+ild{zk+i; x) - Sk+i(7f^{zk+i,x) 
i=0 

which gives the condition (A3) for ipk+i{x). □ 
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4 General convergence estimates of subgradient-based methods 
in the unifying framework 


In this section we show general efficiency estimates of Methods 13.11 and 13.21 for the non-smooth 
and for the structured problems, respectively. We then use the results of this section to derive 
particular convergence rates for these methods in Section [5l 

Note that in general the classical and the modified methods in Methods 13.II and 13.21 will provide 
different convergence rates. They yield the same convergence rate for non-smooth problems but 
the modified method gives much better efficiency than the classical method for smooth problems 
as discussed in Section O 

The following theorems show general estimates for Methods 13.11 and 13.21 which will be proved 
in the remainder of this section. 


Theorem 4.1. Consider a non-smooth problem in the class MSV{g, af). Let {{zk-i-,Xk, gk-,Xk)}k>o 
he generated by Method Wl\ associated with weight parameters {Afc}fc> 0 : scaling parameters {/3k}k>-i ■ 
Then, for every k >0, the estimate 

f{xk) - fix*) + ajCizk, X*) < Mzk-,x*) + Ck ^ 22 ) 


holds, where 


Ck 


is 

is 


^ _llo.lp 

k XfSi 

*=0 Xiaf+Si{l3i.i+Si-iaf) 


hi 


for the classical method; and 
for the modified method. 


(23) 


Furthermore, for every k > 0, the above estimate holds even replacing the left hand side by 
■j;^Y)i=ohfixi) - fix*) + UfCizk^x*) or by mino<i<fc/(xj) - fix*) afCizk,x*) for the classi¬ 
cal method. 


Theorem 4.2. Consider a structured problem in the class SVimf,af,af,L, 5). Let {izk-i,Wk-i, Xk,Xk)}k>o 
be generated by Method\3fM associated with weight parameters {Afc}fc>o, scaling parameters {fik}k>-i ■ 

Then, for every k > 0, the estimate 


holds, where 


fiXk) - fix*) + CTffiZk, X*) < 


fikldizk^ X ) -|- Ck 

^k 


' 5 Si-o A. (Uxt) 

5 Ef-o S. (Mxi) 


Xd(xf + 
<Td { Sf + 




Siihi-l+Si-lCFf) 


\\wi - Xih + Yh=o Xi6ixi,Wi) 
for the classical method; and 

\\xi - Xih + Y)i=o SiSixi,Xi) 

for the modified method. 


(24) 


(25) 


Furthermore, for every k > t), the above estimate holds even replacing the left hand side by 
^ EiLo Xifiwi) - fix*) -L afCizk,x*) or by mino<j<fc fiwi) - fix*) + (Jffizk, x*) for the classical 
method. 


Remark 4.3. Method with aj = df = 0 and /3fc = 0 yields several versions of CCMs be¬ 
cause the constructed auxiliary functions are non-negative linear combinations of constants and 
{m/(xj; x)}JLq. In this case. Theorem \4.S\ implies that the modified method ensures 


Ck iia'm((5)2 T(xi)S 
fixk) - fix*) < ^ < ^ 

Z’k 


+ 


YA=oSi5iXi,Xi 

Sk 


(26) 
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for all k > 0, because ||xj — Xj|p = < ^Diam(Q)^. Note that, if mf{y; •) is affine 

for each y G Q, then the classical CGM <f73)) with Tk ■= \k+i/Sk+i and Xk ■= Xk also admits a 
similar estimat^ 


f{xk)-fix*) < 


Mf{xo) -mf{xo;zo)] L{xi-i)^ Yi=iSid{xi-i,Xi) 


Sk 


+ 


Sk 


+ ■ 


Sk 


(27) 


4.1 Key strategy of the proof 

Under the assumptions of Theorems 14.11 or 14.21 we will prove by induction that the relation 

(Rk) Skf{xk) < f’kiwk) + Ck 

holds for every A: > 0, which is used to prove the estimates (j22p and (j24p . Furthermore, the 
relations 

k k 

(u) E Aj/(xj) < Tpkiwk) + Ck and {Qk) ^ hf{wi) < ipkiwk) + Ck 
2=0 2 = 0 

are also useful to prove the latter assertion of Theorems 14.11 and 14.21 respectively. 

These relations yield the following estimate. 

Lemma 4.4. Suppose that a sequence {xk}k>o C Q satisfies the relation (Rk) for a coupled se¬ 
quence {{ipk{x),ipk{x))}k>-i of auxiliary functions associated with weight parameters {Afc}fc>o, 
scaling parameters {j3k}k>-i, and test points {xk}k>o- If the condition (B3) in Property holds 
for a convexity parameter cjj > 0 and for some lower approximation model mf{y,x) of f{x), then 
we have 

fixk) - fix) + aff,{zk,x) < ^ yxeQ. (28) 

Jk 

Proof. The assertion follows from the condition (B3) and the relation (Rk)', for any x G Q, we have 
k 

Skf{xk) < '^Ximf{xi]x)-\-l3kld{zk]x)-Skaff,{zk,x)-\-Ck < Skf{x)+Pkld{zk]x)-Sk(Tff,{zk,x)-\-Ck. 

i=0 

□ 


Remark 4.5. (1) Analogues of Lemma \4.4\ easily show that (Pk) and (B3) imply the inequality 

k 

min f{xi) - f{x) + af^{zk,x) < “ /(®) + ^fH^k^x) < ^>^k{zk,x) + Ck 

- ^ 2=0 ^ 

for X G Q. The conditions (Qk) ond (B3) also conclude the same replacing Xj by Wi. 

(2) When <7/ > 0, provides bounds for the distances to x* from Xk and Zk-' According to the 
facts /(x) — f{x*) > (T/^(x*, x) and ^(x, y) > ^||x — y|p for x,y & Q, the bound (E^) implies 

min{||xfc - x*f, \\zk - x*f} < ]-\\xk - x*f + ]-\\zk - x*f < l^kld{zk,x ) +Ck ^ 

2 2 CTfadbk 

Lemma 14.41 and Remark 14.51 (1) shows that, in order to complete Theorems 14.11 and 14.21 it 
suffices to prove (Rk) and its variants (Pk) or {Qk)- We now turn to the inductive proof of them. 

"'The proof of [121 Theorem 5 . 3 ] replacing the notation {h{-), \h.^.i,Xk+i, Lk+i, Sk+i, Ok+i, I3k+i,cik) of m by 
{—/(■), Xk,Zk,L{xk),S{xk, Xk+i),Tk, Sk/Xo,Xk/Xo) for fe > 0 shows the desired estimate because showing the result 
uses the assumption [161 e<l-(52)] with {L,5) = {Lk+i,5k+i) only at (A, A) = (Afe+ 2 ,Afc+i), which corresponds to our 
assumption ® at {x,y) = {xk,Xk+i)- 
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4.2 Validity of {Rk), (Pk): and {Qk) when k = 0 

We start the proof of the case k = 0 for our induction. Note that the assumptions of (i) and (ii) 
in the following lemma are exactly the situations of the initialization step (0) in Methods 13.11 and 
13.21 respectively. 


Lemma 4.6. (i) Consider a non-smooth problem in the class MSV{g, a f) and let {{(pkix),iJk{x))}k>-i 
be a coupled sequence of auxiliary functions satisfying ProvertvWl associated with weight parameters 
{^k}k>o, scaling parameters {fik}k>-i, o,nd test points {xk}k>o- Then, the relation (Rq) = (Po) is 
satisfied with xq := xq and 


Co.= 


]_ _^0_ 

2 (Td{Xo(Tf + /3_l) 


Iboll^ 


(29) 


(ii) Consider a structured problem in the class SV{mf,af,df,L,5) and let {{ipk{x),fjkix))}k>-i be 
a coupled sequence of auxiliary functions satisfying Property O associated with weight parameters 
{Afc}fe> 0 ; scaling parameters {/3fc}fc>-i, and test points {xk}k>o- Then, the relation (Rq) = (Qo) is 
satisfied with xq := wq and 


Co Ao 




\\wo - Xof + Xo6{xo,xo). 


(30) 


Proof. Note that (BO) implies that ipk{zk) = Pk{x) > ipkix) = -ipkiwk)- Since {/3fc} 

is non-decreasing, using (B2) with x = w^+i yields that 


ipk+iiwu+i) > ipk{zk) + Xk+imf{xk+i;wk+i) + {/3k + Skaf)f,{zk,Wk+i) 

> f’ki'Wk) + Xk+imf{xk+i]Wk+i) + {/3k + Skcrf)f,{zk,Wk+i) (31) 

for every k > —1. In the case k = —1, the conditions (Bl), 5_i = 0, and Z-i = xq lead (|3T]) to 


i>o{wo) > Xo[mf{xo-,wo) - a^{xo,wo) + {a + P-i/Xo)f{xo,wo)] 


> Ao 


mf{xo;wo) - af{xo,wo) + v ^ 


Ao 


(32) 


for any a > 0. Let us firstly show (ii). Letting a ■.= af, the settings xo = wo and (l30l) yields 


'f’o{wo) + Co > Ao 


mf{xo;wo) - afC{xo,xo) + ||xo - xo|P + (5(xo,xo) 


> Ao/(xo) 


which proves the relation (Po)- 

It remains to prove (i). By the definition of m/(-;-) for the non-smooth case, the inequality 
(j32p with a := af implies 


V'o(w^o) > Ao 


f{xo) + {go, Wo - Xo) + ^ [cTf + ^ ) ||u;o - xo 


/3-1 


Ao 


o-d 


= Xof{xo) + {Xogo, Wo - xq) + y {Xocrp + /3_i) \\wo - xo| 


> Xof{xo) - 


1 


Ag 


2 CTd/XoGf + /3-l) 
where the last inequality is due to the basic fact 




-lIxlP -I- -||s||* > {s,x) for X E P, s E E*. 

2II II 2 ’ 


This means that the relation (Po) is satished with the setting xq = xo and 


(33) 

□ 
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4.3 Validity of (Rk), {Pk)i and {Qk) for the classical method when k > Q 

Let us complete our induction for the classical method. The items (i) and (ii) in the following lemma 
correspond to the A:-th iteration of the classical method in Methods 13.11 and 13.21 respectively. 

Lemma 4.7. (i) Consider a non-smooth problem in the class AfSV{g, a f) and let {{ipkix),'4’kix))}k>-i 
be a coupled sequence of auxiliary functions satisfying ProvertvWl associated with weight parameters 
{Afe}fe> 0 j scaling parameters {ldk}k>-i, and test points {xk}k>o- Suppose for k > 0 that the relation 
(Rk) is satisfied for some Xk & Q, Ck > 0. If the relation Xk+i = Zk holds, then the relation {Rk+i) 
is satisfied with Xk+i ■= 


Sk^k+^k+l^k+l 


S, 


fe+1 


and 


C'fc+i ■— Ck + 


1 




■||5fc+i||J 


fik T 

Furthermore, if (Pk) is satisfied, then so is {Pk-{-i) with the same settings of Xk+i and Ck+i- 


(34) 


(ii) Consider a structured problem in the class SR{mf,af,df, L,6) and let {{ipk{x),'f’k{x))}k>-i be 
a coupled sequence of auxiliary functions satisfying Property O associated with weight parameters 
{Afc}fc> 0 ; scaling parameters {I3k}k>-i, and test points {xk}k>o- Suppose for k > 0 that the relation 
{Rk) is satisfied for some Xk & Q, Ck > 0. If the relation Xk+i = Zk holds, then the relation (Rk+i) 


is satisfied with Xk+i ■= 


._ SkXk+Xk+i'Wk+i 


^k+l 


and 


Ck+i ■— Ck + Afc_|_i 


/Z/(x/j_|_i) /_ fik T ^k'Xf^ 




af + 


A 


fc+i 


I 




Furthermore, if {Qk) is satisfied, then so is (Qk+i) with the same settings of Xk+i and Ck+i 
Proof. Using (l3Tl) and the relation Xk+i = Zk imply for any a > 0 that 

-ipk+i{wk+i) > fik{wk) + Xk+imf{xk+i,Wk+i)-I{fik +Sk(Tf)^{zk,Wk+i) 

= ifkiwk) 

fik T SkCj 


a + 


A 


fc+i 


C{Xk+l,Wk+l) 


+ Afc +1 (^[mf{xk+i-,Wk+i) - af,{xk+i,Wk+i)] + 

> tpkiwk) 

+Afc+i (^[mf{xk+i-,Wk+i) - af,{xk+i,Wk+i)] + y Ikfc+i - Xk+i\ 

For the structured problems, letting a := df and the dehnition of C^+i in (ii) yield that 

ifk+i{wk+i) + Ck+i > -ipkiwk) + Ck + Xk+if{wk+i). 

Using (Rk) and the convexity of / conclude the relation (Rk+i)] (Qk+i) follows by using (Qk) and 
the inequality above. Hence, the assertion (ii) is proved. 

For the non-smooth problems, on the other hand, we can continue by taking a := crj as follows. 


fik+iiwk+i) > 


f^k{wk) + Afc_|_i/(xfc+i) + (Afc_|_i5fc+i, ujfc+i — Xk+i) + y (/3fc + Sk+icr f)\\wk-\-i — Xk+i 


> fik{wk) + Xk+if{xk+i)- 


1 


^1-1-1 


■\\9k-\-l\\l- 


2 CTdifik + Sk+iCTf) 

Hence, the definition ([Ml) of Ck+i yields that 

fik-i-iiwk+i) + Ck+i > ifk{wk) + Ck + \k+if{xk+i)- 

Now the assertion (i) follows by the same way as (ii). 


□ 


18 










4.4 Validity of (Rk) for the modified method when A: > 0 

The following lemma completes our induction for the modified method. In a similar manner as 
Lemma 14.71 the items (i) and (ii) below correspond to the A:-th iteration of the modified method 
in Method 13.11 and 13.21 respectively. 


Lemma 4.8. (i) Consider a non-smooth problem in the class J\fSV{g,af) and let {{(pkix),'ilJkix))}k>-i 
be a coupled sequence of auxiliary functions satisfying Property O associated with weight parame¬ 
ters {Afc}fc>o, scaling parameters {/3fc}fc>-i, and test points {xk}k>o- Suppose for k > 0 that the 
relation (Rk) is satisfied for some Xk ^ Q, Ck > 0. If the relation Xk+i = holds, then 

the relation {Rk+i) is satisfied with Xk+i ■= Xk+i and 


Ck+i ■— 


Ck + 


__ 

T Sk-\-l{(3k T Sk(Xf ) 




(35) 


(ii) Consider a structured problem in the class SV{mf,af,(Tf,L,5) and let {{ipk{x),'il)k{x))fk>-i be 
a coupled sequence of auxiliary functions satisfying Property O associated with weight parameters 
{Afe}fe> 0 j scaling parameters {I3k}k>-i, and test points {xk}k>o- Suppose for k > 0 that the relation 
(Rk) is satisfied for some Xk ^ Q, Ck > 0. If the relations Xk+i = Q^d Xk+i = 

SkXk-i-\k+iwk+i then the relation {Rk+i) is satisfied with 


Ck+i Ck+Sk+i 


( L{xk+i) 



*S'fc+i(/3fc T SkO'j') 


(36) 


Proof. Denote := ^kXk+H+iWk+i _ thenx^^-Xfc+i = ^(rwfc+i- 

Zk). Using (l3T]) and the relation (Rk), we have 


Tpk+iiwk+i) + Ck > Tpkiwk) + Ck-I Xk+imf{xk+i;wk+i) + {flk + Sk(Tf)f,{zk,Wk+i) 

> Skf{xk) + \k+imf{xk+i-,Wk+i) + (/3fc + Skcrf)f,{zk,Wk+i) 

> Skmf{xk+i;xk) + Afc+im/(xfc+i; tcfc+i) + {(Ik + Skcrf)f{zk,Wk+i) 

> Sk+imf{xk+i]x'^^^)-l{(Ik + Skaf)Cizk,Wk+i), (37) 

where we used f(x) > mj(y; x), Vx, y E Q and the convexity of mf(xk+i; ■) for the last two 

S2 

inequalities. Since ^(zk,Wk+i) > ^\\wk+i - ZkW^ = ^t^\\x'i^_^_i - Xk+iW^ and 

^fc + 1 

m/(xfc+i;xfc+i) = m/(xfc+i;xfc+i)-o-^(xfc+i,Xfc+i)+ fjC(xfc+i,Xfc+i) 

> m/(xfc+i;xfc+i) -o-C(xfc+i,Xfc+i) + ^Ikfc+i - x'^+^f 

hold for any a >0, the inequality ([37)1 implies that 

fjk+iiwk+i) + Ck > 5fc+i[m/(xfc+i;xfc+i) - fj^(xfc+i,Xfc+i)] 

, Q ( Sk+l{llk ~\~ SkfXf)\ II , ||2 too\ 

+ ~^Sk+l I CrH-^^2- I IdA:+ 1 “ ^fc+lll • (38) 

Let us prove (ii) at first. Since Xk+i = by the assumption, adding 

I L(Xfc_|_i) (Id j_ (/3fc + S'/jCTj) \ \ 11^ l|2 I o Xi' " ^ 

<Sfe+l I - - - — I <7/ H-^^2- I I IfA:+ 1 ~ SJfc+ill + Dfc+iO(Xfc+l, Xfc+i) 
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to both sides in (I38p with a := aj and using the inequality Q implies the relation (i2fc_|_i) with 
the setting ([361) . 

To prove (i), on the other hand, letting a := aj and using = 

/(Xfc+i) + - Xk+i) leads ([SHD to 


V^/c+lC'^/c+l) “t“ “1“ {5^/cH-l5'/c+l5 ^Ai+l) 

1 


. Q ( Sk+l{j3k Sk<T\ II ! 

+-^Sk+i o-f H--) \\xk+i 


— ^k+lf{^k+l) 


1 


^1+1 
Sl+1 




^ aA+i ( <T, + 


'fc+1 


^k+lf{d^k+l) 


1 


4+iSk+i 


^k+l(.f^k T ‘S'fc<7/) 

This means that the relation (Rk+i) is obtained with (1351) . 

4.5 Proof of Theorems 14.11 and 14.21 




\\9k+l\\l- 


□ 


Let us show Theorem HU the proof of Theorem 14.21 is analogue replacing (Pk) with (Qk) and the 
part (i) with (ii) in Lemmas 14.6114.7114.81 

By the description of the Method 13.11 we can apply part (i) of each Lemmas 14.614. 7I4T81 to show 
that the relation (Rk) holds for every k >0 with Ck defined by (I23p : for the classical method, the 
relation (Pk) can also be verified. The assertion follows from Lemma 14.41 and its analogue for the 
relation (Pk) (see Remark 03] (1)). □ 

We remark that the above lemmas justify our choices for the update formulas of Xk and Xk in 
IVIethods l^^lj and [331 In fact, what is behind the proofs is the satisfaction of the relation {Rk) (or 
its variants). Therefore, the relation (Rk) is an implicit factor in our unifying framework. 


5 Optimal/nearly optimal convergence rates of (sub)gradient-based 
methods 

In this section, we finally give the actual convergence rates for Methods 13.11 and 13.21 based on the 
general estimates presented in Section 0] and compare these results with the existing ones. Our 
choices for weight {Afc}fc>o and scaling parameters {I3k}k>-i resemble and extend the existing ones 
to compute approximate solutions {xk}k>o- 

As a matter of comparison, we summarize the optimal convergence rates for each problem 
classes given in Sections 12.21 and 12.3.11 at Table [H This table shows the optimal convergence rates 
of f{xk) — f{x*) for PGMs applied to non-smooth, smooth, and weakly smooth problems (remark 
that (TdCTf becomes a convexity parameter of / with respect to the norm || • ||; see Section [2T]) . 

For CGMs applied to weakly smooth problems (the class Cjf~ {Q), p E (1,2]), the convergence 

r3jtj© 

/(«)-/(.•) <0(112]^) (39) 

can be achievable using the classical one (I15p or some of its variants |4U] . This rate is known to 
be optimal when p = 2 in the sense of linear optimization oracle m and nearly optimal otherwise 


We show optimal convergence results of PGMs for the non-smooth problems in the next sub¬ 
section, for the structured problems with inexact oracle in Sections 15.21 15.31 and for the weakly 
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Table 1: Optimal convergence rates of PGMs. Here af E o'(/), k is an iteration counter, and ci(-) 
and C 2 (-) are fixed continuous functions. Refer to examples (i) and (iv) in Section l2.,'l.ll for the 
descriptions of smooth and weakly smooth problems, respectively. 


problem class / type of convexity 


non-strongly convex {af = 0) 


strongly convex {af >0) 


non-smooth problem with (|8]) for some M > 0 
smooth problem Cl{^{Q) 
weakly smooth problem Clf^^{Q), p E [1,2) 


o yM 


d{x 


adk 
Ld{x*) \ 
o-dk'^ ) 


of: 


Cl 




P 

2 3p-2 

k -2- 


M2 

\ad(Tfk 

O (|exp 


—k 


)) 


C2(p) ( 


fc-(3p-2)^ 




smooth problems in the last subsection, all for the strongly convex cases. Optimal and nearly 
optimal convergences of CGMs are developed in Sections 15.31 and 15.4.41 

All of convergence rates matches the known optimal rates of convergence (excepting the classical 
method for the structured problems). 

A noteworthy new result is the attainment of the optimal convergence rate for weakly smooth 
problems in the strongly convex case with less prior information of the objective function than the 
existing ones (Section I5.4.3p . In addition, for smooth problems, the obtained convergence rates 
slightly improve the existing ones fSections 15.21 and 15.3p . 

Another consequence is that the existing methods included in our unifying framework can 
be naturally extended for wider classes of problems. In particular, without using a multistage 
procedure, the DAM for the non-smooth problems can be extended to the strongly convex case 
fSection 15.ip . and Nesterov’s and Tseng’s PGMs can be extended to the weakly smooth and/or 
the strongly convex cases [Sections 15.3115.4|) . 

5.1 Optimal convergence rate for non-smooth problems 

Let us analyze the convergence rate of PGMs yielded from Method 13.11 Recall that Method 13.11 
generates a sequence {xk} which satisfies the relation {Rk) with Ck defined by (|23]) . 

When af = 0, the definitions of Cp for the classical and the modified methods become the 

same: Cp = ^ 115*11* > analyzed in [22l Corollary 11] which ensures the 

optimal convergence rate 0{M^d{x*)/{udk) with an advantage that we do not need values d{x*) 
and M in the definition of the parameters {Afc} and {jdk} to achieve 0(l/\/A:)-convergence. 

When (T/ > 0, note that 

_^ 

Xfaj + Si{(3i-i + Si-iaf) “ A-i + SiUf 

holds since \i/Si < 1. In this case, theoretically, the classical method ensures not a worse conver¬ 
gence rate than the modified counterpart. 

We give an optimal convergence result with a simple choice for the parameters = (A: -|- l)/2 
and /3fc = 0 below. Note that every subproblem minajgQ ^k{x) has a unique solution even if /3fc = 0 
because a{ipk) 3 ^k + Skcrj = Skaf > 0 (see the proof of Lemma [33]). 

Theorem 5.1. Consider a non-smooth problem in the class M SV {g, a f). Let{{zk-i-,Xk-,gk,Xk)}k>o 
he generated by Method\3A\ associated with Xk = {k l)/2 and fdk = 0. Assume that af > 0 and 
supfc>o WdkW* < Alf < -boo. Then, we have 

2M^ 

max{/(xfc) -/(x*), mm^f{xi)-f{x*)} + afi{xk+i,x*)<-^-^-j^-^^, V/c > 0 
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with the classical method, and 


fr- \ I c( ^ k + \ogk + 3/2 f Mj \ 

f(.Xk) — fix ) + afiizk,x ) < - —— -—-- = O - k— 

^ ^ ^ ^ ^ ^ - ad<Jf ik + l)ik + 2) \adafkj 

with the modified method. 

Proof. Since /3fc = 0 and Sk = ^ Theorem 14.11 implies the estimate 


fixk) - fix*) + afCizk,x*) < ^ = 


4Cfc 


Sk ik + 1)(^ + 2) 


VA: > 1 


( 40 ) 


with Ck defined by ([23]) . The classical method also admits the same estimate replacing fixk)—fix* 
by mino<j<fc fixi) - fix*) and we have 


1 


Ck = —V 

Orr. ^ 


A? „ . M] 


‘2‘O'd . „ Pi—I T SiOr 

i=\) •' 


\Mi < 


ho 


Using the inequality 


y'' A? _i + 1 ^ (/c + 1)(A: + 2) 


i + 2 


fc + 4 


(41) 


i=0 i=0 

(see m Proposition 7.3]), we obtain the first assertion for the classical method. 


In the modified method, on the other hand, we have 
k 


Ck =—y 




^ >ifcrf + SiiPi-i + Si-iaf) 


2arfU/^ i(z + 2)+4 


and 


A (z + l)(i + 2) ^ 1 ^ (i + l)(f + 2) ^ 1 1\ 1 

_)_ 2) + 4 “ 9 ^ 7 Y 1 4- 91 9 Z-,- \ 7 / - 9 ^ ^ ^ 


i(f + 2) 


7=0 ' ^ 7=1 ' ^ 7=1 

for all /c > 1, which leads (|4nh to the second assertion. 


□ 


Note that the choices of parameters Afc = (A: + l)/2 and /3fc = 0 do not depend on Mf and aj. 
However, we need af when we solve the subproblems. For instance, the classical method with the 
extended MD model (fT^ associated with the above parameters becomes 

Xk+i:=Zk ■= aT:gmin{Xk[fixk) + {gk,x-Xk)+(7f^ixk,x)] + Sk-i(7f^ixk,x)} 
x&Q 

= argmin{Afc[/(xfc) + {gk,x- Xk)] + SkafCixk,x)} 

x&Q 

argmin |-^^[/(xfc) + {gk,x- Xk)] +^ixk,x] 


x&Q 


argmm , f, < 
x&Q fik S' 2) 


Xk ■ = 


Sk^ 




[fixk) + {gk, X - Xk)] +f.ixk, x] 
k 


Xi = 


7=0 


ik + 1)(A; + 2) 


^(z + l)Xi, 


7 = 0 
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which gives the estimates 


max{/(xfc) - /(x*), mino<j<fc/(xj) - /(x*)} + afC{xk+i,x*) 

min{||xfc - x*|p, ||xi(fc) - x*|p, ||xfc+i - x*|p} 

for all A: > 0, where i{k) E ArgminQ<j<;^/(xj) (see Lemma 14.41 and Remark 14.Sp . Notice that the 
computation of Zk is equivalent to the subproblem (IlOp (the extended MD model for non-strongly 
convex case) with := = 1- This result is closely related to [30l Theorem 1], [3l 

Proposition 3.1], and [29l Proposition 2.8]. 

The convergence result (1421) is also valid for the DA model (I20p . and then we conclude that 
a strongly convex version of the DAM achieves the optimal complexity for non-smooth problems 
(see Section . This result is new. Note that we do not exploit the multistage procedure and do 
not require an upper bound of d{x*) to obtain the optimality as required in [25] . 


< 


< 


2M2 


ad<7f{k + 4) 


2Mj 


(42) 


ajajik + A)' 


5.2 Convergence rate of the classical method for structured problems with 
constants L and S 

We next analyze the convergence rate of PGMs produced by Method 13.21 for a particular case 
of structured problems. Let us consider a structured problems in SV{mf,af,d'f,L,5) for the 
particular case L(-) = L > 0 and (5(-, •) = <5 > 0. In this case, we assume that L > dfa^,] notice 
that, in view of mf{y;x) < f{x) and C{y,x) > ^JJx — y\\‘^ for x,y £ Q, the inequality ([6]) yields 
0 < (L — dfad)^\\y — xjj^ + 5. We firstly show a convergence result of the classical method of 
Method 13.21 which does not ensure the optimal convergence rate for the class C^^{Q). This rate is 
as better as the existing PGMs compared in this subsection. 


Theorem 5.2. Consider a structured problem in the elass ST’{mf,af,af,L,6). Assume addition¬ 
ally that L(-) = L > 0, = 6 > 0, and L > a fad- Let {{zk-i-,Wk-i-,Xk-,Xk)}k>o be generated 

by the classical method of Method \d.S\ with 


a \ 1 X h + Skaf 

Pk — -, Ao = 1, Afc+i =- 


0-d 




(43) 


Then, for every k > 0, we have 


f{xk)-f{x*)+af^{zk,x*) < y^‘^ ld{zk;x*) min 


fXd 


1 - 


a/ad 


L-afad + ajadJ 'k + 1 


>+A (44) 


Furthermore, the left hand side of can be replaced by ■^Yli=o ^kfiwk) — f{x*) aff,{zk,x*) 
or by minQ<i<k f{wi) - f{x*) + afC{zk,x*). 

Proof. The classical method admits the relation {Rk) and {Qk) with 


Ck = (l - ad (af -b 11^* “ ^*11^ + ^ AiA 

^ i=0 

The definitions of Xk and (dk implies that Ck = ~ (since = ilT2n) 

and S'fc = l-b Sk-i for all k > 0. Therefore, we have Sk > k-\-l and 5^ > (1-|- = 

(1 ~ jd the result follows from Theorem 14.21 □ 
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Notice that the right hand side of goes to <5 as A: ^ oo. 

It is interesting to notice that the particular choice of parameters (1431) does not necessarily 
require the knowledge oi aj and aj for the implementation of the classical gradient method with the 
extended MD model (fT9l) : for smooth problems {i.e., f G C'^’^(Q)), for instance, the corresponding 
subproblem can be rewritten as follows: 

ifixk) + (V/(xfc), X - Xk) + afC{xk,x)] + PkCixk,x) + Sk-icrfC{xk,x)} 

f{xk) + {Vf{xk),x - Xk) + (^d-f + ^ i{xk,x) 

fixk) + (V/(xfc), X - Xk) + —^{xk, x) \ , (45) 

(Xd J 

which requires only L; in the Euclidean setting {i.e., ■^^{xk,x) = \\\xk — furthermore, the 
Lipschitz condition ([ 6 ]) ensures that f{xk+i) < f{xk) because Xk+i = Zk is given by (145]) . The 
classical gradient method with the DA model (|20l) and the hybrid model ()2ip . on the other hand, 
do not possess this advantage. 

Let us see the corresponding PGMs for other particular structures. 

• Consider the composite problem mina;eQ[/(x) = /o(x) + l^'(x)] as the example (ii) in Section 

12.3.11 with the structure Uf = af^ = 0 (and thus Uf = in the Euclidean setting (then, 
ad = 1). Choosing parameters by ()43]) . the classical gradient methods with the extended MD 
model and the hybrid model yield the Gradient Method QAi{xo, L) and the Dual Gradient 
Method VQ{xo,L) in [38], respectively (in this case, we do not exploit the procedure to 
estimate the Lipschitz constant L). Then, Theorem 15.21 improves the convergence rates 
shown in [38| as follows: The linear convergence factor 1 — = L+af P^'ovided by (I44p is 

less than the one in |38l Theorem 5] (because 1 — for any 7 > 1) and 

the same linear convergence is also valid for the method T>Q{xo,L) which is not presented in 
the paper (the linear convergence for the dual gradient method was firstly demonstrated in 

m)- 

• For the convex problems with inexact oracle model as the example (iii) in Section [2]3T] in the 
Euclidean setting (then, cjj = af, ad = 1), the classical gradient method with the extended 
MD model and the hybrid model yield the primal and the dual gradient methods in HI], 
respectively (but the definition ()43l) of {Xk} is slightly different from (4.1) and (4.2) in [lljL 
Because of cr^ = 1 and (L — af)ld{zk',x*) < Ld{x*) = xIIxq — x*\\ 2 , the estimate (l44p slightly 
improves Theorems 4 and 5 in [llj (Since af = df, the factor of linear convergence is the 
same). 

Note that the classical gradient method of Method 13.21 with the DA model (|20l) can reduce the 
subproblems of the dual gradient method from two [ni[38| to one, preserving the same convergence 
rate. 

5.3 Optimal convergence rate of the modified method for structnred problems 
with constants L and 6 

The modified method of Method 13.21 for the structured problem in the particular case L(-) = L > 
0, 6{-, ■) = 6 > 0 can be analyzed as follows. Differently from the classical method, it achieves the 
optimal convergence rate for the class C]^’^(Q). The result below further implies efficient rates for 
the GGMs, too. 


Zk ■ = 


argmin {. 
xeQ 


= argmin < 
xeQ I 


= argmin < 
x£Q [_ 
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Theorem 5.3. Consider a structured problem in the class SV{mf,af,af,L,d). Assume addition¬ 
ally that L(-) = L > 0, S{-, ■) = 5 > 0, and L > 

(i) Let {(zfc-i, Xfc, Xfc)}fc>o he generated by the modified method of Method \3.^ with 

fik ^-—Ao = 1, {L - afad)Xl+i = ddiSkCFf + I3k-i)i>^k+i +Sk) (k >0) (46) 

O'd 

(i.e., Afc+i is determined as the largest root of the above quadratic equation). Then, for every k>t), 
we have 

f(x,)-nr) + .y,Uz„x-) < r) min(i + } 

(Tfdd J 


+ min Wk + \ log(A: + 2 ) + 1 , 1 + 
3 b 


(a) Suppose further that aj = 0 and Q is hounded. Let {{zk-i,Wk-i,Xk,Xk)}k'>o be generated by 
the modified method of Method HOI with fik = 0, Xk ■= (k + l)/2 as a CGM (refer Remark \4-3^ - 
Then, for every k >0, we have 


f{xk) - f{x*) < 


2Lmaxo<i<fc \\wi - Zi-i\ 

k + 4 


+ 


A; + 3 


Proof. By Theorem 14.21 we have the estimate (1240 with 


i=0 

1 Aa2 


^ ^5"* ( L(xj) - fjd ( + 


Si{(3i-i + Si-iaf) 

A? 


Xi - Xi 


I T / - I Si-iGf) \ \ 2 

[L-ad[af + -^ ) ) lltTi - Zi-i" 


2^S, 

1=0 


' + ^Si5{xi,Xi) 
i=0 
k 

+ Y,Si5. 


i=0 


(i) Notice that, since A^+i -\- Sk = Sk+i, (|46p eliminates the above hrst summation so that we have 
Ck = Yl^=o Therefore, using Lemmas lA.ll to IATH given at Appendix, for the analysis of (@ 6 ]), 
(IMl) leads to the assertion. 

(ii) Letting Xk = {k + l)/2, fik = 0, and u/ = 0 in Theorem 14.21 with Ck described above and using 
the inequality (1411 establish that 


fixk) - fix*) < ^ = 

Jk 




Af I 
0 Si I 


~ ^ ELo < 2Lmaxo<,t<fc \\wi - Zj^i 


2Sk 


+ 


Sk 


fc + 4 


|2 A: + 3^ 
- + ^^5. 


□ 


When 5 > 0, the bounds obtained in Theorem 15.31 (i) and (ii) diverge as A: —)• oo unless cry > 0 
(strongly convex case) for the assertion (i). Thus, the parameter (5 > 0 must be sufficiently small in 
order to ensure an approximate solution with a desired precision. One can see further discussions 
on these bounds in mm- 

In the non-strongly convex case cry = uy = 0, Tseng’s PGMs [43] are derived from the modihed 
method with the model (1191) or (I20p and Nesterov’s PGM [43] is derived with the hybrid model (|21D . 
From these facts, one can conclude that the first result of Theorem 15.31 yields the strongly convex 
versions of Tseng’s and Nesterov’s PGMs with optimal complexity (see m for the verification of 
the optimality). The fast/accelerated gradient method in [111 [T^l38] for strongly convex problems 
are different from these three particularizations of the models (I19p to m- 
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Let us consider the Euclidean setting d{x) = ^||x —xo||i, (Jd = 1- The first assertion of Theorem 
15.31 applied to the convex problems with inexact oracle model (recall the example (iii) in Section 
12.3.11 and the fact that af = dj), is slightly better than the estimate [!!( Theorem 7] in view 
of (L — af)ld{zk;x*) < Ld{x*) and Furthermore, the first assertion applied to the 

composite problems minx£Q[f{x) = fo{x) +\P{x)] (the example (ii) in Section r2.3.1l) is the same 
as Nesterov’s one [38l Theorem 6] with 7^ = 2 (recall that dj = af^ = 0, ct/ = a<p). Therefore, 
Method 13.21 achieves the optimal complexity for smooth and strongly convex problems (see Section 

[QD . 

The second result of Theorem 15.31 matches the conclusion for the classical CGM observed in 
m Section 5.2.1]. If we further assume / G then the corresponding implementation of 

the second assertion with the extended MD model (|19j) and the DA model (I20j) yield particular 
instances of the CGMs proposed by Lan m (see Section [2. 3. 2p . 

5.4 Optimal convergence rates of the modified method for weakly smooth prob¬ 
lems 

Gonsidering structured problems in the case when 6{y, x) = P £ [1; 2), we can provide 

convergence analysis for problems involving weakly smooth functions of the class (see 

examples (iv) and (v) in Section [2.3.ip . Note that the smooth case p = 2 reduces to the situation 
6{y,x) = 0 which has been already discussed. In this section, we show convergence results of 
modified proximal/conditional gradient methods for this setting. In the case p = 1, the results 
from Sections 15.4.11 to 15.4.31 can be seen as variants of stochastic gradient methods developed in 
mm for the deterministic setting. 


5.4.1 General convergence estimates of the modified method for weakly smooth prob¬ 
lems 

Our analysis for proximal gradient methods is based on the following lemma. 

Lemma 5.4. Consider a structured problem in the class SV{mf,af,df, L,6). Assume that 6 {y,x) = 

\\y ~ ^11^; P ^ [1)2), M(-) > 0. Let {{zk-i,Wk-i,Xk,Xk)}k>o be generated by the modified 
method of Method \3.^ with weight parameters {Afc}fc>o and scaling parameters {fik}k>-i- Put 

ak ■= L{xk) — cTd (^df + ^ If Oi < 0 for each 0 < i < k, then we have 

2 k 

^ ^ /3kldizk-,x*) {2 - p)maxo<i<kM{xi)^-p Si 

f[Xk) - fix ) + af^izk,x ) < -- 2 ^ 


Sk ‘^pSk (_q,.)2-p 

Proof. Note that the function ^(r) = ar^ + for r>0, a<0, 5 gM satisfies maXp>o gir) = 

—P 2 

-^(—2o)2-p(p6)2-p. Hence, Theorem 14.21 concludes that 

k 


fixk)-fix*) + af^izk,x*) < 


fikldizk',x*) 1 


Sk 


i=0 

k 


+ ^ ^ 5'i ( -ai\\xi - Xijj^ + 


Mjxj 

P 


Xi - Xi 


< 


l3kld{Zk\X*) , 1 


Sk 


i=0 


2p 


which proves the assertion. 


□ 
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5.4.2 Optimal convergence rates for the non-strongly convex case 

Let us deduce a convergence result of PGMs given by the modified method of Method 13.21 for the 
non-strongly convex case af = af = 0. The result with p = 1 is closely related to the deterministic 
versions of [18l Proposition 8 ] and [ 8 l Corollary 1], 

Theorem 5.5. Consider a structured problem in the class S'P{mf,af,af,L,d). Assume addition¬ 
ally that L(-) = L > 0, aj = af = 0, and S{y,x) = ^^^\\y — x\\^ for p G [1,2), M(-) > 0. Let 
{{zk-i,Wk-i,Xk,Xk)}k>o be generated by the modified method of Method \3.S\ with 


fik ■= — +—{kP\ 7 > 0 . 


Then, for every k >{), we have 

4Lld{zk-,x* 


f{xk) - fix*) < 


<Xdik + l)(fe + 2 ) 


+ 


4'fldizk]x*) maxo<i<fcM(a:j)2-p 

' P 


CTd 


3/97 2 -p 


(A;-1-3) 2 
[k -|- 1 ) (A; -|- 2 ) 


Proof. We apply Lemma 15.41 to prove the assertion. Note that 


fik 


4L 


+ 


47(A: + 3)i(2-p) 


Sk o'dik1)(k2) ad{k1)(k2) 


(47) 


and Ofc in Lemma 15.41 becomes now ak 
more, we have 


L 

' k+l 


(fc-|- 2 )i( 2 -p)+i 
'f k+l 


< -7 




fc-l-1 


< 0. Further- 


Si 


Sk ^ i-ai)^-p 


< 


< 


k 


(i -h l)2-p 


+1 


=0 472-p(i 2 ) 2 ^’^ 2 -p 1 


^k _ rt Zln /2 —p 

1 2 

47 ^-pS*^ 3(2 — p) 


< 


1 


^'y^-'’Sk i=o 


+ 2 ) 


2 -ip 


{k 3) 


3-Ip ^ 


2(A;-h3)i(2-p) 


3(2-/9)7^(fc + l)(/s + 2) 


(48) 


where the second and the third inequalities are due toi-|-l<i -|-2 and the fact — 

j^ik 3)^''''^, V(? > —1, respectively. Consequently, the theorem follows by applying Lemma [5^ 
with the inequalities (HT)) and ([48|) . □ 


Notice that we need the parameter p to dehne fik but not the M(-). Now let us calculate an 
efficient choice for 7 . Suppose that M(-) < M < -|-oo. Using ldizk',x*) < d{x*) and the fact that 

the function 51 ( 7 ) = 07 -|- {a,b,p > 0) attains its minimum at 7 * = {pb/a)p+^ on (0, 00 ) with 

-P P 1 

5 ( 7 *) = (P + l)pp+^ap+^bp+^, the choice 


* 

7 = 7 


/ p M'^-p Ud \ ^ 

^2 — p 3p 4d(x*) j 


= M 


o'd \ 
12(2 - p)dix*)) 


2-p 

2 


makes the estimate of Theorem 15.51 as follows; 


f{xk) - fix*) < 


2 ’’ 

4Ldix*) 2 / p \~2 / 4(i(x *) \2 " (A; + 3)l(^-^) 

adik + l)ik + 2)^ 2-p\2-p) V cr^ J ^ 3p J ik+l)ik + 2) 

4Ldix*) 2i2V3)P fi-ff ^i^*') ']^ (A; + 3)i(^~^^ 

Ud(A:-b 1)(A;-|-2) 3p(2 —p)^ \ / (A;-b l)(fe-b 2) 
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Note that mina.>oa^^ = (1/e)^/® and maXpg[]^ 2 ] = ^(2\/3)^ = 4 because log(2\/3) > 1 

implies the positivity of the derivative of ^(2\/3)^- Therefore, we have ^ ( 2 e) 

3p(2-p)^2“ 

shows f{xk) — f{x*) < O + M ^ ^^ • Consequently, we obtain an upper 

bound of the iteration complexity to obtain /(x^) — f{x*) < e which is proportional to 


Ld{x*)Y fd{x*)\'^ (M 

crd£ J \ (^d J \ £ 


2 

3p-2 


In view of the lower complexity (1131) (with L replaced by M there), it turns out that the order of 
the second term is optimal for the class ^{Q)- 

5.4.3 Optimal convergence rate for the strongly convex case 

Now we show a convergence result of PGMs for the strongly convex case cj > 0. 

Theorem 5.6. Consider a structured problem in the class SV{mf,crf,df,L,6). Assume addi¬ 
tionally that L(-) = L > 0, af > 0, and 6{y,x) = \ \y — x\\^ for p € [1,2), M(-) > 0. Let 

{(zfc_i, rcfc-i, Xfc, Xfc)}fc>o be generated by the modified method of Method \3.S\ with 


Afc := 


p+1 


{k + iy, /3, + + 


where p > 1 and fi > 0 with CTdcrj + pL + {p + ycrafi > 0. Then, for every k > 0, we have 
fixk)-fix*)+crf^{zk,x*) < +/3^ (p +l)^Zrf(zfc;x*)|^-i-^j^ 


+ 


+ 


(p+ 1)(2 - p)maxo<j<fc M(xi) 2 -p 1 

2p{addf + pL + (p + l)adl3)^ (fc + 1 )p+^ 

3P+i(2 - p) maxo<i<fcM(xj)^ f 2P~^{p+ l)2\ ^ 


2p 


O-dCTf 


P{k), 


where 


P{k) = < 


p+2-i^Y\k+ir-0 :p+i>e, 


2-pJ 

1 + log k 
{k + l)P+i 

l-(p+2-A 

{k + l)P+i 


I 3p—2 


2-p 

3p-2 

2-p 


^P+l< 


Proof. Note that fik is non-decreasing and < <S'fc < ■ Then, we 

have 




( 49 ) 


Since the inequalities > 


and 


SkSk- 


(fc+i)p-i ^ Yiy (fc+ip-i ^ 2 P-ig+i)^ for fc > I imply 




-Oik O'd [Of + 


Ski.fik—1 T 'S'fc—l^Xj) 


Ai 


) - T > addf + fiad + ^p-i^pi i)2 > Q’ ^ ^ 
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we obtain 


Sk 


< 


(P+ 1 )H ^d(7f 


2P-\p + l)^\^-p {k + 2)P+^ ^ 3P+^ f2P-^{p + iy\^" 


A ^(p + 1 )^ 


for all A: > 1. Combining with 


■S'q _ 

p 


1 ^ 




< 


(-00)^ (p+l)(o-dO-/+pL+(p+l)o-d/ 3 )^ 

P+l 1 


(TdCTf 

yields that 




'S'fc -^Q {-ai) 2-p {addf +pL + {p+ l)adP) 2 -p 
where the factor P{k) is due to the following inequality: 


+ 


3 P +1 + 

\ CTdCTf ) 


( 50 ) 


i”? < < 1 + log /c 


Z =1 


1 - 


1+9 


q > -1, 

g = - 1 , 

q<-\. 


Consequently, the assertion follows from Lemma 15.41 with the inequalities (I49p and (1501) . 


□ 


Notice that we do not need p and M(-) in the definition of the parameters \k,(3k', the result 
holds for all acceptable p G [1, 2). If we further have p + 1 > ^ 2 -p ’ then P{k) has the best rate of 
convergence for a fixed p. Now let us see the above upper bound in the case L = 0, af = af > 
0, M(.) = M, /3 = 0, p+l>^: 


f{xk)-f{x*) + afC{zk,x*) < (p+i)(2 p)A^ (fc+i^+l 

2p{aiuj)^P ^ ^ 






-1 


(fc + 1 )- 


3p-2 

■ 2-p , 


Since this bound is of O {c{p, p) p^^pp/( 2 -p) k '^-p ^ for a continuous function c{p,p), it achieves 
the optimal complexity (1131) for the strongly convex case. In contrast to the optimal method in 
m, we do not need to restart the method and do not require M and an upper bound of d{x*) in 
advanc^ to ensure the optimality. 

Let us consider the non-smooth case p = l, df = aj > 0. Then, taking p = 1 and /3 = 0 yields 
Xk = {k + l)/2, /3fc_i = L/ud, and 

... . , tr / 4L/rf(zfc;x*) m.axo<i<k M 18maxo<i<fc 

/M -/(.) + + + ■ 

This result is similar to the ones [181 Proposition 9] and [U Corollary 2] in the deterministic case. 


5.4.4 Optimal/nearly optimal convergence rate of conditional gradient methods 

We finally consider the case of conditional gradient methods: Pk = 0, af = df = 0. This case can 
be analyzed without Lemma 15.41 

®As is indicated in m. an obvious upper bound of d{x*) can be obtained if V f{x*) = 0 and we know M for 
the weakly smooth problems (example (iv) in Section l2.3.1|l in the Euclidean setting d{x) = ^||x — xoHi : The 
inequality d{x*) < ^ follows since we have ^\\x* — loHi < /(®o) — fix*) < "yll^o — x*\\2 (recall the 

strong convexity and ®)- 
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Theorem 5.7. Consider a structured problem in the class SV{mf,af,af,L,d). Assume addition¬ 
ally that L(-) = L > 0, af = aj = 0, and d{y,x) = ^\\y — x\\^ for p G [1,2), M > 0. Then, 
the modified method of Method \3.^ for the problem with = (/c + l)/2 and (3^ = 0 generates a 
sequence {xk}k>o C Q satisfying 


2LDiam(Q)2 MDiamiQ)P 

J{Xk) J[x ) _ ^(3 _ 2)P-i 


(51) 


for every k >0. 

Proof. Theorem 14.21 yields that f{xk) — fix*) < Ck/Sk with Sk = {k l)(/c + 2)/4 and 


k 


Ck = Y.S, 

i=0 



M Af 

p sr^ 


Wi 


Zi-1 


(see Remark l4.3p . Using the inequality (1411) and 


E 

i=0 


AL 


E 


i + 1 


< 


2‘2-p (i + 2)P-^ - 22-p E. 
i=0 ^ ^ i=0 


E(^ + i) 


2-p 


< 


22-p(3 - p) 


ik + 2 ) 


3-p 


(the first and the second inequalities are due to i+1 < i+2 and the fact 
for q > 0, respectively), we conclude that 

, ,, .,^Ck ^ 2LDiam(Q)2 2/’MDiam(Q)^ (fc + 2)2 -p 

The estimate (ISTIl now follows from < 2 for > 0. □ 

The bound (f5T]) is also valid for the classical CGM (fT5]) with Tk ■= Xk+i/Sk+i = Xk '■= Xk\ 
it can be derived in the same way as Theorem 15.71 based on the estimate (1271) since /(xq) — 

© ifT^ \p 

mfixQ;zo) < ■|Diam(Q)2 + ^Diam((5)^ and (5(xfc_i,Xfc) = yllxfc - Xfc-ill^ = ^-^\\xk-i - 

Zk-lW^ < ^^Diam((5)^ for > 1. This result in the case L = 0 is very similar to a known result 
for the classical CGM (see [9l Proposition 1.1] and [40]b 

Since the choice Xk = [k + l)/2 and = 0 are independent of L,M, and p, the conditional 
gradient methods can be applied to the classes Clf~^{Q), p G (1,2] ensuring the convergence 

/(xfe) — f{x*) < O ^ • Thus, our CGMs ensure the same convergence rate as the known 

one ()39p of existing CGMs for weakly smooth problems. 

When we choose the extended MD model (I19p or the DA model (j20p in Theorem 15.71 the 
obtained CGMs match particular cases of ban’s CGMs mentioned in Section 12.3.21 Since the 
convergence rates for ban’s CGMs was analyzed only for smooth problems in m. our result 
provides a generalization of them for weakly smooth problems. 


6 Conclusion 

This paper proposes a new framework for applying (sub)gradient-based methods to minimize 
strongly convex functions. It unifies the analysis of PGMs and CGMs for several classes of prob¬ 
lems including non-smooth, smooth, and weakly smooth problems. We have introduced the notion 
of strong convexity with respect to the prox-function, which generalizes the one in the Euclidean 
setting. The proposed PGMs establish optimal convergence rates for these problems with slight 
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improvements than some existing methods. Furthermore, particular cases of the framework yield a 
family of variations of the classical CGM with optimal and nearly optimal guarantee of convergence 
in the non-strongly convex case. 

A remarkable novel result in this paper, in view of method efficiency, is the achievement of the 
optimal complexity for the weakly smooth problems (the class (Q), v ^ [0,1)) in the strongly 
convex case without knowing the constant M and an upper bound of d{x*) (Section 15.4.31 see 
also Section [2.3.11 (ivl for remarks on the literature). The theoretical approach for that is similar 
to the ones in [IIIIIZIEH] because the structure ([6]) assumes an oracle inexactness of the original 
problem. Furthermore, the analysis of Sections 15.4.21 and 15.4.31 can be seen as a generalization of 
the techniques of [laiiH] in the deterministic case. 

We finally describe several topics for further considerations. At first, we can consider a general¬ 
ization/combination of the (sub)gradient-based methods here with smoothing techniques, stochas¬ 
tic situations, or uniformly convex settings. Related studies can be seen in dg EH [251127]. Secondly, 
one can further consider to tune the parameters, the weight and the scaling ones, to obtain an 
efficient convergence. The proposed choices in Section [5] are not the only way to ensure the opti¬ 
mal convergence; see, e.g., [I6l[29| for some discussions on other choices. Thirdly, it is important 
to note that the convergence results for the class {Q) in Sections 15.4.21 15.4.31 are not adap¬ 
tive in contrast to the known method [39| proposed by Nesterov; namely , it does not ensure the 
optimal convergence without knowing the parameter v. From the practical viewpoint, it will be 
important to develop techniques to ensure efficient convergence rates without such problem specific 
information. 
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A Appendix 


In order to complete the proof of Theorem 15.31 we need to obtain upper bounds for l/S^ and 
Yli=o for the sequence {Sk}k>o defined by (1461) . Since A^+i = Sk+i — S^, writing r := 

L-4/CTri — sequence {S'fcjfc^o in ([461) is determined by the recurrence 

So = l, (Sfc+i-Sfc)2 = ,Sfc+i(l + r5fc), k>0 (52) 

where the root of the equation in S^+i takes the largest one, namely, 

1 + (2 + r)Sk + ^/(l + (2 + r)Sky — dS"! 

Sk+i = -L-. (53) 

The essentials of lemmas below are the same as [m Lemma 4-7] excepting the replacement of /r/L 
in the article by an arbitrary r > 0 . 


Lemma A.l. For any sequence {Sk}k>o defined by for r > 0, we have 


1.1 4 / 2 

Sk ~ I {k + l){k + 4)' \2 + r + Vr^~TM 


Vk > 0. 


Proof. Since Sk+i > Sk, we have 


/c- /c“ \ 'ESI 1 — -p- 1 

^Sk+i - VSk = ^ ^ ^ - = x\/l + r5fc>- 

V+ V Jfc Jk+l ^ ^ 


(54) 


which shows \/Sfc > | + \/5o = for all k > 0. Then, we have 


fe-i 


k-l 


k-l 


k-l . 


Sk-So = -Si) = Yl VSi+i{l + rSi) > ^ > E 


i + 3 k{k + 5) 


i=0 


i=0 


i=0 


i=0 


which gives 5^ > 5o -|- Qn the other hand, using (f53]l yields that 


1 _ A + ^ + + + ^ 2 + r+v^(2 + r)2-4 _ 2 + r + + 4r 


Sk+ 

Sk 2 “ 2 2 

for all k>0. Hence, we have Sk > Sq (^ 2 +r+^^ y ^ y+r+^^ y_ 

Remark. The linear convergence factor in the above lemma satisfies 

2 


1 - 


< 


r+1 2 + r + y/r"^ + Ar 


< 1 + TTVr 


-2 


(55) 

□ 
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In fact, since 


1 - 


r + 1 


-1 


y/r + 1 


y/r + l- y/r 


= y/r + l{\/r + 1 + y/r) = 


2 + 2r + V 4r2 + 4r 


we obtain 


1 2 + r/2 + V^ 2 + r + + 4r 2 + 2r + V 4r2 + 4r 

1 H— y/r =- < - < - 

2^ 2 - 2 - 2 


= 1 - 


r + 1 


-1 


Note that if erf = ct/ and r = then yj^i = \J 

Lemma A.2. The sequence {Sk}k>o defined by /f5l|) for r > 0 satisfies 


^ + + ^ ^ 1 + A A’ ^ 0 - 

2 V r 


Proof. Notice that 7 := J_ 


1 - 


7 


-1 


7 


satisfies 

Vl + 4r“i + 1 (Vl + 4r“i + 1)^ 2 + r + + 4r 


7 — 1 \/l + 4r“i — 1 


4^-1 


Therefore, we obtain < 1 — 7 by (l5^ . Now the result follows by induction: If Ei=o ^i/^k <7 
holds for some k >0, we have 


E fc+l Q c 

^=0 = 1 J- ^ 

Jk+1 -Jfc+l Dfc 


to 


7 


This proves the first inequality; the second can be verified from y/l + Ar ^ < 1 + 2 \/r 


□ 


Note 


l+v'l+4r-l 
2 


that the result of Lemma IA.2I is the same as m Lemma 5] because 1 + 


Lemma A.3. Let {Sk}k>o be defined as Lemma [A . S\ and {Tk}k>o be defined by with r := 0, 
namely Tq := 1 and Tk+i := ^+^A+Vi+ 4 rfc k > 0. Then, we have 


E k Q V^fc rp 

i=0 i < l^i=0 yf. > 


Sk 


Tk 


Proof. Due to the identity 

Eto 

Sk 

Sk 


k—1 Q k—1 k—1 ^ 

i+E|=i+En ' 


i=0 


1=0 ^=i E+1 


, k > 0, 


it is enough to show that every A; > 0. Notice that we have 

^ + 2 + y^ + 2)T4 


Jfc +1 

Sk 


Tfc+i _ + V 

~w~ 


(56) 
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which suggests us to prove ^ for A; > 0. It is true for A; = 0 by So = Tq. If it holds for 

Tfc’ 

2a 


k > 0, then, writing a := > /? := we obtain 


1 + r5fc+i 1 + r5fc S^ dig 

> —- a = 


Sk+1 


+ 2 + \J{a -\- 2)^ — 4 


a 


> 


‘S'fc+i ‘S'fc+i 

_ W_ 

/3 + 2 + ^(/3 + 2)2-4 Tfc+i^ Tfc+i 


isg Tfc 1 


since > Si. and x !->■ -= -—-— , ^ , is non-decreasing on (0,oo). Hence, 

we claim ^ for all A: > 0 and therefore the proof is completed. □ 

Lemma A.4. Let {Tfc}fc>o he a sequence defined by with r := 0, namely Tq := 1 and T^+i := 
i+2n+vi+4n for k>0. Then, we have 


< \k + ]:log{k + 2) + l, VA:>0. 

Tk 3 6 

Proof. The case A: = 0 is obvious. Assume that the assertion is true for some A: > 0. Putting 
Uk := ^k + ^ log(A: -|- 2) -|- 1, we have 


E fc-|-1 rp rp rp rp 

2=0 ^ _ 1 -(- ^ 2 ^ i =0 ^ _|_ ^ 

Tfc+1 Tfc+1 Tk ~ Tk+i 


T 

Hence, it remains to show 1 -|- j^Uk < t/fc+i for A: > 0. For that, we analyze the sequence 

to := 1, tk+i := Tk+i — Tfc for A: > 0 (namely, Tk = '^{= 0 ^ 1 )- The recurrence relation of Tk implies 
tl = {Tk - = Tk and 


^fc+1 o’ 


Analyzing the difference tk+i — tk shows for A: > 0 that 
^ -2tk I I 

tk+l — tk = -~ n fo 


1 

^2 + 


1 1 

= + 


(^l + 4t2+ 24) 2 2(y^ + 24) 2 


Since Lemma [A . 1 1 yields tk = y/Tk > (k + l){k + 4)/4 > (A: -|- 2)/2 for A: > 0, we obtain 


A;-bl lAl k 3 lA 2 A: 3 1 ,, , 3,, 

4-1-1 < An H- 1 — / — I - 1— / - < —I - 1 — log(A: + 2) = —Uk 

k+i - 0 ^ 2 8^Ai“2 2 8^A-F2-2 2 4 ^^ ^ 2 

i=0 i=0 

for all A: > 0. Finally, this upper bound of tk concludes that 


Uk 


SUk 


l + f/.-f/.+l 2 + llog|±|-2 


> T-Uk > Afc-i-1 — 


Tl 


k+l 


tk+l Tk+i — Tk 


T 

Taking the inverse and multiplying by Uk for both sides yield 1 -|- jU^Uk < Uk+i- 

^ fc+1 


□ 


36 












































